¹¹institutetext: University of Mannheim, Schloss, 68161 Mannheim, Germany
¹¹email: {[email protected], [email protected], [email protected]}

Using LLMs for the Extraction and Normalization of Product Attribute Values

Nick Baumann 11 0009-0001-1215-9153 Alexander Brinkmann 11 0000-0002-9379-2048 Christian Bizer 11 0000-0003-2367-0237

Abstract

Product offers on e-commerce websites often consist of a textual product title and a textual product description. In order to provide features such as faceted product filtering or content-based product recommendation, the websites need to extract attribute-value pairs from the unstructured product descriptions. This paper explores the potential of using large language models (LLMs), such as OpenAI’s GPT-3.5 and GPT-4, to extract and normalize attribute values from product titles and product descriptions. For our experiments, we introduce the WDC Product Attribute-Value Extraction (WDC PAVE) dataset. WDC PAVE consists of product offers from 87 websites that provide schema.org annotations. The offers belong to five different categories, each featuring a specific set of attributes. The dataset provides manually verified attribute-value pairs in two forms: (i) directly extracted values and (ii) normalized attribute values. The normalization of the attribute values requires systems to perform the following types of operations: name expansion, generalization, unit of measurement normalization, and string wrangling. Our experiments demonstrate that GPT-4 outperforms PLM-based extraction methods by 10%, achieving an F1-Score of 91%. For the extraction and normalization of product attribute values, GPT-4 achieves a similar performance to the extraction scenario, while being particularly strong at string wrangling and name expansion.

Keywords:

Information Extraction Value Normalization LLMs

1 Introduction

The goal of product attribute value extraction is to identify attribute-value pairs in product descriptions. To enable features like faceted product filtering and product recommendation, the extracted attribute values must be normalized to be represented on single, attribute-specific scales. Figure 1 shows an example of a product offer together with attribute-value pairs that have been extracted from the product title. The attribute values are shown as directly extracted values and on the right as normalized values. Existing methods for product attribute value extraction often require large amounts of domain-specific training data to learn extraction rules [3, 8, 13], label attribute value sequences [5, 15, 19] or extract attribute values using question answering [10, 16]. The methods focus on extracting sequences of tokens that form attribute values but do not cover the normalization of the extracted values, despite value normalization being required in many use cases [4].

Refer to caption — Figure 1: Product offer with extracted and normalized attribute-value pairs.

Motivated by the success of large language models (LLMs) in related NLP tasks [12, 18] and other information extraction use cases [1, 2, 6, 9, 11], this paper explores the potential of LLMs for product attribute-value extraction for two scenarios: (i) direct extraction and (ii) extraction with normalization. In the direct extraction scenario, the objective is to extract the exact surface form of attribute values from product titles and descriptions [10, 14, 15, 16]. In the extraction with normalization scenario, the goal is to extract and normalize attribute-value pairs. As current benchmarks for extracting product attribute values are solely designed for measuring extraction quality and do not cover the normalization of the extracted data [16, 14, 17], we introduce a new dataset, WDC Product Attribute-Value Extraction (WDC PAVE). WDC PAVE consists of 1,420 product offers that have been annotated by 87 websites using the schema.org vocabulary. The 24,583 attribute-value pairs have been verified manually. The attribute-value annotations are available in two formats: (i) extraction only and (ii) extraction with normalization. To normalize the values, the systems need to perform name expansion, generalization, unit of measurement conversion, and string wrangling. In summary, we make the following contributions:

1.

We propose prompt templates for instructing LLMs to extract and normalize attribute-value pairs from product titles and descriptions. The templates cover use cases with and without training data.
2.

We introduce the new dataset WDC PAVE containing 1,420 heterogeneous product offers and 24,583 manually verified attribute-value pairs, which are prepared for the two scenarios: (i) extraction and (ii) extraction with normalization. In contrast to existing benchmarks [16, 14, 17], WDC PAVE covers value extraction and value normalization.
3.

We experimentally compare the extraction performance of GPT-3.5 and GPT-4 and PLM-based methods on WDC PAVE. GPT-4 achieves the overall best results with an F1-score of 91% and outperforms PLM-based methods by 10%.
4.

We experiment with extracting and normalizing attribute values using GPT-3.5 and GPT-4. When using training data for example value and demonstration selection, GPT-4 performs particularly well for string wrangling and name expansion with F1-scores of 95% and 98% respectively. In contrast, existing work on attribute value normalization using LLMs [4] does not consider the attribute value context for the normalization.

This paper is structured as follows: Section 2 introduces the new dataset WDC PAVE. Section 3 describes the experimental setup. Section 4 and Section 5 discuss the results of the experiments for both scenarios: (i) extraction and (ii) extraction and normalization of product attribute values. The dataset WDC PAVE and the code for replicating all our experiments are available online¹¹1https://github.com/wbsg-uni-mannheim/wdc-pave.

2 WDC Product Attribute-Value Extraction

This section introduces the WDC PAVE dataset. First, we describe the collection of product offers and attribute-value pairs using schema.org annotations and specification tables. Second, the statistics of WDC PAVE are profiled. Third, the normalization operations name expansion, generalization, unit of measurement normalization, and string wrangling are introduced.

Data collection. The WDC PAVE dataset is derived from the Web Data Commons Large-Scale Product Matching (WDC LSPM) corpus²²2https://webdatacommons.org/largescaleproductcorpus/v2/ [7]. WDC LSPM features over 26 million product offers originating from 79 thousand websites using the schema.org vocabulary. The offers are categorized into 26 product categories. In addition to schema.org annotations, WDC LSPM extracts attribute-value pairs from specification tables found on the web pages. For WDC PAVE, we clean the product offers and attribute-value pairs, omitting those missing titles, descriptions, or specification tables, and those with descriptions exceeding 1,000 characters. Additionally, HTML and language tags are stripped away, and only product texts in English are preserved. We select the four product categories ’Computers and Accessories’, ’Jewelry’, ’Grocery and Gourmet Food’, ’Office Products’, and ’Home and Garden’ as they contain a large number of product offers and attribute-value pairs after pre-processing. Next, a fixed set of attributes per category is identified for a random sample and all attribute-value pairs are manually verified. The manual verification ensures that all product offers are annotated with every attribute relevant to its product category. Each attribute value is a sub-string of the offer’s title or description. If an attribute is not mentioned in the title or description, it is annotated with ’n/a’.

Dataset Statistics. Table 1 shows statistics for our dataset. It comprises 24,583 attribute-value pairs from 1,420 product offers extracted from 87 unique host websites. 51.15% of the attribute values are associated with a value in either the title, description, or both, while 48.85% of the product offers do not have their attributes explicitly mentioned in the title or description. On average, each product offer has 6.51 attributes associated with a value captured in the product title or description, and 8.91 attributes are ’n/a’. On average, there are about 70 unique values for each attribute, with a broad range observed across different attributes. For comprehensive statistics, please visit the project website³³3https://webdatacommons.org/structureddata/wdc-pave/.

Table 1: Statistics for WDC PAVE

	Home &	Computers &	Grocery &	Office
Kategorie	Garden	Accessories	Gourmet	Produkte	Jewelry	Overall
Uniq. Attributes	20	15	8	18	9	70
Attribute-Value Pairs	7,854	7,492	814	5,942	2,481	24,583
Uniq. Values	6,397	4,137	1,108	6,110	991	18,743
Product Offers	356	436	81	297	250	1420
Host Websites	19	13	20	12	23	87

Data Normalization. Among the 70 attributes in WDC PAVE, we identified 37 attributes for which the attribute values should be normalized before being used within applications like faceted product search. We have identified four normalization operations: name expansion, generalization, unit of measurement conversions, and string wrangling. Table 2 illustrates the normalization operations with examples for selected attributes. Each attribute requiring normalization is allocated to one of the normalization operations. The attribute values are manually normalized. Both the extracted attribute values and the extracted and normalized attribute values are available in our code repository.

Table 2: Overview of attribute value normalization operations in WDC PAVE

Operation

Attributes

Examples

Name

Expansion

Manufacturer,

Generation,

Capacity, Cache

"HP"

\rightarrow

"Hewlett-Packard"

"PII"

\rightarrow

"Pentium II"

"G1"

\rightarrow

"Generation 1"

Generalization

Product Type,

Color, Processor

Typ

"Oatmeal"

\rightarrow

"Snacks and Breakfast"

"Sparkling Juices"

\rightarrow

"Beverages"

"Neon Lime Green"

\rightarrow

"Green"

"Canary"

\rightarrow

"Yellow"

Unit of

Measurement

Normalization

Dimensions,

Paper Weight,

Size/Weight,

Rotational Speed,

Pack Quantity

"7""

\rightarrow

"17.8"

"164 ft"

\rightarrow

"4998.7"

"20-lb."

\rightarrow

"9.06" (kg)

"0.31 oz"

\rightarrow

"879" (g)

"10k"

\rightarrow

"10000"

String

Wrangling

Identifiers, Ports

Processor Core,

Retail UPC,

Brand

"CTW-4M(208)"

\rightarrow

"CTW4M208"

"Dual Port"

\rightarrow

"2"

"4-Core"

\rightarrow

"4"

"Quaker Foods"

\rightarrow

"QUAKER FOODS"

3 Experimental Setup

For our experiments, we split WDC PAVE into a training and a test set with a ratio of 75:25, stratified by product category. A random subset of 20% of the training records per product category is used to select in-context demonstrations and example values to simulate a scenario with a few labeled attribute-value pairs. Attributes that do not require value normalization are removed from both the training and test sets. We access GPT-3.5-turbo-0613 referred to as GPT-3.5 and GPT-4-0613 referred to as GPT-4 through the OpenAI API. The temperature parameter of the LLMs is set to 0 to reduce the randomness. As baselines for the experiments, we finetune the PLM-based extraction methods SU-OpenTag [14], AVEQA [10] and MAVEQA [16] on the training set. For the evaluation, we follow related work and calculate F1-scores based on the exact match between the predicted and the ground truth attribute values [2, 10, 14, 15, 16].

4 Extraction

In this section, we evaluate prompt templates for extracting attribute-value pairs from product offers in WDC PAVE. Following Brinkmann et al. [2], the prompts ask the LLM to extract all attributes in the target schema in a single step. Each template consists of up to six building blocks that are shown in Figure 2. The role description (blue) defines the overall goal of the LLM as well as the target schema for the extraction including attribute descriptions and example values. The target schema is encoded as a JSON-schema because this this representation proved to be most successful in related work [2]. The task description (blue) contains instructions for the attribute-value extraction. The task input (green) consists of the product title and the product description. The task output (orange) contains the LLM’s extracted attribute-value pairs in JSON format. The demonstration task input (green) and output (yellow) contain demonstrations with product offers that are semantically similar to the target product offer and are selected from the training set. Figure 2 contains the chat message building blocks for both scenarios: extraction and extraction with normalization. The red text is added to the prompts for the extraction and normalization scenario.

We assess the impact of zero-shot and few-shot prompt template configurations on the performance of GPT-3.5 and GPT-4. Table 3 shows the F1-scores of the prompt template configurations with different amounts of example values (val.) and the combination of 10 example values and different amounts of demonstrations (dem.). The highest F1-score with and without demonstrations is marked in bold. The results show that both GPT-3.5 and GPT-4 benefit from ten example values, achieving F1-scores of around 80%. Demonstrations further improve the performance of GPT-3.5 and GPT-4. Ten demonstrations allow GPT-4 to achieve the highest F1-score of 91%. We compare the performance of GPT-3.5 and GPT-4 with the PLM-based baselines SU-OpenTag [14], AVEQA [10] and MAVEQA [16] fine-tuned on the same training set that is used to select example values and task demonstrations. The GPT-4 using the 10-dem prompt template outperforms the best PLM-based baseline AVEQA by 10%.

Table 3: F1-scores for the extraction scenario.

					10 Example Values &			PLM Baselines
LLM	0 val.	3 val.	5 val.	10 val.	3 dem.	5 dem.	10 dem.	SU-OpenTag	60.44
GPT-3.5	70.61	76.46	77.06	79.37	87.17	87.42	87.91	AVEQA	80.83
GPT-4	74.40	78.57	78.96	80.70	88.94	88.87	90.54	MAVEQA	65.10

5 Extraction with Normalization

This section evaluates prompt templates for the scenario in which attribute values are extracted from product offers and normalized to be comparable to other attribute values of the same attribute. The prompt template in Figure 2 is extended by the text in red. The extensions add normalization guidelines to the task description, mappings of example values to their normalized forms to the target schema in the role description, and use normalized values in the task output of the demonstrations.

As in Section 4, we assess how zero, three, five and ten example values without demonstrations and ten example values with three, five and ten demonstrations selected from the training set impact the performance of GPT-3.5 and GPT-4. Table 4 shows the F1-scores of the experiments. The highest F1-scores with and without demonstrations are marked in bold. GPT-4 benefits from normalized example values and reaches an F1-score of 86%, which is 12% better than the zero-shot configuration. In contrast to the extraction scenario, example values only marginally improve the extraction and normalization performance of GPT-3.5. However, adding five demonstrations enhances the performance of both GPT-3.5 and GPT-4, with GPT-4 achieving an F1-score of 91% and outperforming GPT-3.5 by 5%.

Table 4: F1-scores for the extraction with normalization scenario.

					10 Example Values &
LLM	0 val.	3 val.	5 val.	10 val.	3 dem.	5 dem.	10 dem.
GPT-3.5	68.86	69.75	69.91	69.36	85.45	86.04	84.49
GPT-4	74.19	83.77	84.90	85.60	91.18	91.32	91.31

We now analyze the normalization operations in detail. Table 5 shows the F1-scores for the best configuration with and without demonstrations per normalization operation. Previous research has demonstrated that GPT-3’s performance is weaker in tasks that require reasoning or calculations, as opposed to tasks that involve manipulating free text or names [4]. Our results support these observations, GPT-3.5 and GPT-4 achieve the best zero-shot result for the string wrangling operation. With ten demonstrations, GPT-3.5 and GPT-4 are particularly strong at name expansion and string wrangling. GPT-4 achieves F1-scores of 98% and 95%. Unit of measurement normalization requires calculations and is the most challenging operation for GPT-3.5 and GPT-4. Example values and demonstrations improve GPT-4’s zero-shot performance by 22%, leading to an F1-score of 84%. Compared to pure extraction, we observe that the generalization of attribute values simplifies the task for GPT-4, likely because it can utilize its background knowledge for the generalization while making fewer errors in choosing the correct attribute value boundaries. For generalization attributes like the ’Product Type’, we observe across the five product categories that GPT-4 benefits from the generalization on average by 7% if example values and demonstrations from the training set are provided.

Table 5: F1-scores by normalization operation.

GPT-3.5

GPT-4

Normalization Operation

0 val.

10 val.

10 dem.

0 val.

10 val.

10 dem.

Name Expansion

41.61

42.15

94.50

48.64

96.50

98.27

Generalization

75.27

76.63

82.16

76.48

79.82

88.56

Unit of Measurement Norm.

51.16

47.64

76.24

61.76

66.42

83.50

String Wrangling

87.07

83.78

93.37

92.41

93.60

95.19

6 Conclusion

In this paper, we investigate the ability of GPT-3.5 and GPT-4 to extract product attribute values from product offers and normalize the attribute values to enable downstream e-commerce applications like faceted product filtering. For the evaluation, we introduce the WDC PAVE dataset, featuring manually verified ground truth values for both tasks, extraction and extraction with normalization. We experiment with different prompt designs that utilize example values and demonstrations selected from a training set.GPT-4 achieves the best F1-score of 91% in the extraction scenario, surpassing the best PLM baseline by 10%, and shows similar performance for the extraction with normalization scenario. GPT-4 excels in wrangling strings and in expanding names, where it can utilize its background knowledge. Normalization tasks that require reasoning such as the generalization of attribute values or performing calculations, such as the unit of measurement normalization, are most challenging for both GPT-3.5 and GPT-4. To improve the unit of measurement normalization, a compelling avenue for future research is giving LLMs access to functions or a code interpreter for calculating the unit of measurement conversions.

References

[1] Agrawal, M., Hegselmann, S., Lang, H., et al.: Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. pp. 1998–2022 (2022)
[2] Brinkmann, A., Shraga, R., Bizer, C.: Product Attribute Value Extraction using Large Language Models. arXiv preprint arXiv:2310.12537 (2023)
[3] Ghani, R., Probst, K., Liu, Y., et al.: Text mining for product attribute extraction. ACM SIGKDD Explorations Newsletter 8(1), 41–48 (2006)
[4] Jaimovitch-López, G., Ferri, C., Hernández-Orallo, J., et al.: Can language models automate data wrangling? Machine Learning 112(6), 2053–2082 (2023)
[5] Jain, M., Bhattacharya, S., Jain, H., et al.: Learning cross-task attribute-attribute similarity for multi-task attribute-value extraction. In: Proceedings of The 4th Workshop on e-Commerce and NLP. pp. 79–87 (2021)
[6] Parekh, T., Hsu, I.H., Huang, K.H., et al.: Geneva: Benchmarking generalizability for event argument extraction with hundreds of event types and argument roles. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 3664–3686 (2023)
[7] Primpeli, A., Peeters, R., Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of The 2019 World Wide Web Conference. pp. 381–386 (2019)
[8] Putthividhya, D., Hu, J.: Bootstrapped named entity recognition for product attribute extraction. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. pp. 1557–1567 (2011)
[9] Shyr, C., Hu, Y., Harris, P.A., et al.: Identifying and extracting rare disease phenotypes with large language models. arXiv preprint arXiv:2306.12656 (2023)
[10] Wang, Q., Yang, L., Kanagal, B., et al.: Learning to extract attribute value from product via question answering: A multi-task approach. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 47–55 (2020)
[11] Wang, X., Li, S., Ji, H.: Code4struct: Code generation for few-shot event structure prediction. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 3640–3663 (2023)
[12] Wei, J., Tay, Y., Bommasani, R., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)
[13] Wong, Y.W., Widdows, D., Lokovic, T., et al.: Scalable attribute-value extraction from semi-structured text. In: 2009 IEEE International Conference on Data Mining Workshops. pp. 302–307. IEEE (2009)
[14] Xu, H., Wang, W., Mao, X., et al.: Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 5214–5223 (2019)
[15] Yan, J., Zalmout, N., Liang, Y., et al.: AdaTag: Multi-attribute value extraction from product profiles with adaptive decoding. arXiv preprint arXiv:2106.02318 (2021)
[16] Yang, L., Wang, Q., Yu, Z., et al.: Mave: A product dataset for multi-source attribute value extraction. In: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. pp. 1256–1265 (2022)
[17] Zhang, X., Zhang, C., Li, X., et al.: Oa-mine: open-world attribute mining for e-commerce products with weak supervision. In: Proceedings of the ACM Web Conference 2022. pp. 3153–3161 (2022)
[18] Zhao, W., et al.: A survey of large language models. arXiv preprint arXiv:2303.18223 (2023)
[19] Zheng, G., Mukherjee, S., Dong, X.L., et al.: Opentag: Open attribute value extraction from product profiles. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1049–1058 (2018)