11institutetext: University of Mannheim, Schloss, 68161 Mannheim, Germany
11email: {[email protected], [email protected], [email protected]}

Using LLMs for the Extraction and Normalization of Product Attribute Values

Alexander Brinkmann 11 0000-0002-9379-2048    Nick Baumann 11 0009-0001-1215-9153    Christian Bizer 11 0000-0003-2367-0237
Abstract

Product offers on e-commerce websites often consist of a product title and a textual product description. In order to enable features such as faceted product search or to generate product comparison tables, it is necessary to extract structured attribute-value pairs from the unstructured product titles and descriptions and to normalize the extracted values to a single, unified scale for each attribute. This paper explores the potential of using large language models (LLMs), such as GPT-3.5 and GPT-4, to extract and normalize attribute values from product titles and descriptions. We experiment with different zero-shot and few-shot prompt templates for instructing LLMs to extract and normalize attribute-value pairs. We introduce the Web Data Commons - Product Attribute Value Extraction (WDC-PAVE) benchmark dataset for our experiments. WDC-PAVE consists of product offers from 59 different websites which provide schema.org annotations. The offers belong to five different product categories, each with a specific set of attributes. The dataset provides manually verified attribute-value pairs in two forms: (i) directly extracted values and (ii) normalized attribute values. The normalization of the attribute values requires systems to perform the following types of operations: name expansion, generalization, unit of measurement conversion, and string wrangling. Our experiments demonstrate that GPT-4 outperforms the PLM-based extraction methods SU-OpenTag, AVEQA, and MAVEQA by 10%, achieving an F1-score of 91%. For the extraction and normalization of product attribute values, GPT-4 achieves a similar performance to the extraction scenario, while being particularly strong at string wrangling and name expansion.

Keywords:
Information Extraction Product Attribute Value Extraction Value Normalization Large Language Models

1 Introduction

Product attribute value extraction (PAVE) identifies attribute values in product titles and descriptions. After normalizing the extracted attribute values to attribute-specific scales, they are used for tasks such as faceted product search or product comparison. Figure 1 shows an example of a product offer and attribute-value pairs that have been extracted from the product title. For each attribute, the extracted and the normalized value are displayed.

Refer to caption
Figure 1: Product offer with extracted and normalized attribute-value pairs.

Existing methods for PAVE often require large amounts of domain-specific training data to fine-tune pre-trained language models (PLM) that label attribute value sequences [9, 24, 29] or extract attribute values using question answering [21, 26]. The methods focus on identifying the sequences of tokens that form attribute values but do not cover the normalization of the extracted values. Motivated by the success of large language models (LLMs) in related NLP tasks [22] and other information extraction use cases [1, 3, 12], this paper explores the potential of LLMs for the following PAVE tasks: (i) direct extraction, (ii) extraction with normalization, and (iii) normalization. The objective of the direct extraction task is to extract sequences of tokens that form attribute values from product titles and descriptions [21, 23, 24, 26]. The goal of the extraction with normalization task is to extract and normalize attribute-value pairs in a single step. The normalization task aims at normalizing attribute values that were extracted by a separate, preceding extraction step [8]. Since current benchmarks for extracting product attribute values are designed for measuring extraction quality and do not cover the normalization of attribute values [23, 26, 28], we introduce a new benchmark dataset, Web Data Commons - Product Attribute Value Extraction (WDC-PAVE). Unlike existing benchmark datasets, which contain only data from a single source [23, 26, 28], WDC-PAVE consists of 565 product offers originating from 59 different websites that use the schema.org vocabulary. In contrast to related work [23, 26], the 4,687 attribute-value pairs in WDC-PAVE have been manually verified and are available in two formats: (i) extracted and (ii) normalized. To normalize the attribute-values, the systems need to perform name expansion, generalization, unit of measurement conversion, and string wrangling. In summary, this paper makes the following contributions:

  1. 1.

    We propose prompt templates for instructing LLMs to extract and normalize attribute-value pairs from product titles and descriptions. The templates cover use cases with and without training data. In contrast to existing work on attribute value normalization [8], the templates exploit the attribute value context for the normalization.

  2. 2.

    We introduce the WDC-PAVE benchmark consisting of 565 heterogeneous product offers and 4,687 manually verified attribute-value pairs. The benchmark supports three tasks: (i) direct extraction, (ii) extraction with normalization, and (iii) normalization. In contrast to existing benchmarks [26, 23, 28], WDC-PAVE covers value extraction and value normalization.

  3. 3.

    We experimentally compare the extraction performance of GPT-3.5 and GPT-4 to the PLM-based extraction methods SU-OpenTag [23], AVEQA [21], and MAVEQA [26] on WDC-PAVE. GPT-4 achieves the overall best results with an F1-score of 91% and outperforms the PLM baselines by 10%.

  4. 4.

    We experiment with extracting and normalizing attribute values in a single step using GPT-3.5 and GPT-4. Given 10 example attribute values and 5 demonstrations, GPT-4 again reaches and overall F1-score of 91%. The model performs particularly well for string wrangling and name expansion with F1-scores of 95% and 98% respectively.

  5. 5.

    We experiment with normalizing previously extracted attribute values using GPT-3.5 and GPT-4. Given 10 example attribute values and 5 demonstrations, GPT-4 reaches a F1-score of 96%, which is 5% higher than in the extraction and normalization scenario.

The paper is structured as follows: Section 2 introduces the benchmark dataset WDC-PAVE. Section 3 describes the experimental setup. Section 4, Section 5, and Section 6 discuss the experimental results for the scenarios: (i) direct extraction, (ii) extraction with normalization, and (iii) normalization of product attribute values. Related work in discussed in Section 7. The WDC-PAVE benchmark and the code for replicating the experiments are available online111https://github.com/wbsg-uni-mannheim/wdc-pave.

2 The WDC-PAVE Benchmark

This section introduces the WDC-PAVE benchmark. First, we describe the collection of product offers and attribute-value pairs using schema.org222https://schema.org/ annotations and product specification tables within web pages. Second, we present profiling statistics about the WDC-PAVE benchmark. Third, we introduce the normalization operations.

Data collection. The Web Data Commons (WDC)333https://webdatacommons.org/ project extracts structured data from the Common Crawl444https://commoncrawl.org/ and provides the extracted data for public download. The WDC Product Data Corpus (WDC LSPM)555https://webdatacommons.org/largescaleproductcorpus/v2/ [13] is one of the extracted datasets. It consists of over 26 million product offers originating from 79 thousand different websites which employ the schema.org vocabulary to annotate structured product data within their HTML pages. The offers are classified into 26 product categories. In addition to schema.org annotations, WDC LSPM extracts attribute-value pairs from specification tables found in the web pages. The attributes in these pairs are product category-specific, such as the number of processor cores of a computer. Category-specific attributes are not part of the schema.org vocabulary and therefore are not explicitly annotated in the web pages. We clean the product offers and attribute-value pairs in the WDC LSPM corpus, omitting those with missing titles, descriptions, or specification tables, and those with descriptions exceeding 1,000 characters. In addition, HTML and language tags are stripped away, and only product offers in English are kept. We select the five categories ’Computers and Accessories’, ’Jewelry’, ’Grocery and Gourmet Food’, ’Office Products’, and ’Home and Garden’ for WDC-PAVE because they contain a large number of product offers and attribute-value pairs after pre-processing. Subsequently, a random sample of product offers is drawn for each category, with the objective to manually verify their attribute-value pairs. Based on the sampled product offers a fixed set of attributes per category is determined. As the attribute-value pairs in the specification tables are heterogeneously annotated on the different websites, a human annotator is required to verify that each attribute value is a sub-string of the title or the description, and that it semantically fits the attribute. Additionally, the human annotator adds attribute values that are not contained in the specification tables but are mentioned in the product offer to the gold standard. If an attribute is not referenced in the title or the description, the value "n/a" is assigned to this attribute.

Dataset Statistics. Table 1 shows profiling statistics describing the WDC-PAVE benchmark dataset. The dataset consists of 4,687 attribute-value pairs from 565 product which originate from 59 different websites. Overall, 45% of the attribute-value pairs hold the attribute value "n/a" meaning the attribute is neither mentioned in the title nor in the description. The dataset contains 2,011 unique attribute values so that each attribute has on average 54 unique values.

Table 1: Statistics for WDC-PAVE
Home & Computers & Grocery & Office
Kategorie Garden Accessories Gourmet Produkte Jewelry Overall
Unique Attributes 8 11 5 10 3 37
Attribute-Value Pairs 1,136 1,914 160 1,180 297 4,687
Unique Values 493 576 136 658 148 2,011
Unique Norm. Values 305 343 92 388 116 1,244
Product Offers 142 174 32 118 99 565
Host Websites 16 10 6 10 17 59

Data Normalization. Each of the 37 attribute in the dataset requires to be normalized before being usable for applications such as faceted product search. We have identified four normalization operations. Table 2 illustrates the normalization operations with examples for selected attributes. Name Expansion deals with the expansion of abbreviated attribute values such as "HP" into their non-abbreviated form, e.g. "Hewlett-Packard". Generalization assigns attribute values to broader categories, e.g. the color "Neon Lime Green" to the more general category "Green". Unit of Measurement Normalization converts an attribute value to an attribute-specific target unit of measurement and format, such as the weight value "20-lb." to "9.06", which represents the weight in kilograms (kg). String Wrangling normalizes attribute values to a specific format by for example replacing words with numbers or removing non-alphanumeric characters, e.g. the value "CTW-4M(208)" would be normalized to "CTW4M208". Each attribute is assigned to one of the normalization operations. After normalizing the 2,011 unique values in WDC-PAVE, the dataset contains 1,244 unique normalized attribute values.

Table 2: Overview of attribute value normalization operations in WDC-PAVE
Operation Attributes Examples
Name
Expansion
Manufacturer,
Generation,
Capacity, Cache
"HP" \rightarrow "Hewlett-Packard"
"PII" \rightarrow "Pentium II"
"G1" \rightarrow "Generation 1"
Generalization
Product Type,
Color, Processor
Typ
"Oatmeal" \rightarrow "Snacks and Breakfast"
"Sparkling Juices" \rightarrow "Beverages"
"Neon Lime Green" \rightarrow "Green"
Unit of
Measurement
Normalization
Dimensions,
Paper Weight,
Size/Weight,
Rotational Speed,
Pack Quantity
"7"" \rightarrow "17.8"
"164 ft" \rightarrow "4998.7"
"20-lb." \rightarrow "9.06" (kg)
"0.31 oz" \rightarrow "879" (g)
"10k" \rightarrow "10000"
String
Wrangling
Identifiers, Ports
Processor Core,
Retail UPC,
Brand
"CTW-4M(208)" \rightarrow "CTW4M208"
"Dual Port" \rightarrow "2"
"4-Core" \rightarrow "4"
"Quaker Foods" \rightarrow "QUAKER FOODS"

3 Experimental Setup

For our experiments, we split the WDC-PAVE dataset into a training set with 211 product offers and 1,750 attribute-value pairs as well as a test set with 354 product offers and 2,937 attribute-value pairs, stratified by product category. We access GPT-3.5-turbo-16k-0613 referred to as GPT-3.5 and GPT-4-0613 referred to as GPT-4 through the OpenAI API666https://platform.openai.com/docs/api-reference. The temperature parameter of the LLMs is set to 0 to reduce the randomness. GPT-3.5 and GPT-4 are not fine-tuned. Instead, we select semantically similar demonstration product offers for in-context learning. As baselines for the experiments, we fine-tune the PLM-based extraction methods SU-OpenTag [23], AVEQA [21] and MAVEQA [26] on the training set. The fine-tuning is executed on a single NVIDIA RTX A6000 GPU. Since, the prompt templates for the LLMs and the PLM-based extraction methods utilize the same training set for demonstration selection and fine-tuning, respectively, we assume that this is a fair comparison. For the evaluation, we follow related work and calculate F1-scores based on the exact match between the predicted and the ground truth attribute values [3, 21, 23, 24, 26].

4 Direct Extraction

This section compares various prompt templates for extracting attribute-value pairs from the product offers in WDC-PAVE. Following Brinkmann et al. [3], the prompts ask the LLM to extract all attributes of the target schema in a single step. Figure 2 shows an example of a complete prompt.

Refer to caption
Figure 2: Example prompt. The parts set in back font are used for the extraction. The red parts are added for the extraction with normalization task).

Prompt Templates. Each prompt template consists of up to six building blocks visualized by different background colors. The building blocks are role description (blue), task description (blue), task input (green), and task output (orange), demonstration input (green) and demonstration output (yellow). The role description is a system message that defines the overall goal of the LLM as well as the target schema for the extraction including attribute descriptions and example values. The target schema is encoded using JSON-schema777https://json-schema.org/ as this representation proved to be most effective in related work [3]. The task description is a user message that provides instructions for the attribute-value extraction. The task input is a user message containing the product title and description. The task output contains the LLM’s response with the extracted attribute-value pairs in JSON format. The input and output of the demonstrations are user and assistant messages containing product offers that are semantically similar to the target product offer in the task input. To calculate the semantic similarity, the target product offer and the training demonstrations are embedded using OpenAI’s embedding model text-embedding-ada-002888https://platform.openai.com/docs/guides/embeddings/. The training demonstrations with the highest cosine similarity to the product offer are added into the prompt. Figure 2 contains instructions for extraction and normalization. The red text is added to the prompts for the extraction with normalization task and the normalization task, which are discussed in Section 5 and Section 6.

Discussion of Results. We assess the impact of zero-shot and few-shot prompt template configurations on GPT-3.5 and GPT-4 performance. Table 3 shows the F1-scores, the average number of tokens per prompt, and the cost in $ per 1,000 extracted attribute values of the prompt template configurations with different amounts of example values (Val.) and the combination of 10 example values and different amounts of demonstrations (Dem.). The highest F1-score with and without demonstrations is marked in bold. For the cost calculation, we use the OpenAI prices as of April 2024999https://openai.com/pricing. The results show that GPT-3.5 and GPT-4 benefit from ten example values, achieving F1-scores of around 80%. Demonstrations further improve the performance of GPT-3.5 and GPT-4. Ten demonstrations allow GPT-4 to achieve the highest F1-score of 91%. Both, example values and demonstrations implicitly guide the LLMs to extract exactly the surface form of the attribute values that is used in the offer and is expected by the ground truth. The usage of example values and demonstrations significantly increases the length and the cost of the prompts. Adding three demonstrations increases the cost of extracting 1,000 attribute values with GPT-4 by 1$. In order to enhance GPT-4’s performance by an additional 1.6%, an additional 2$ per 1,000 extracted attribute values must be spent.

Table 3: Results for the direct extraction task.
F1 Average length $ per 1K values
Val. Dem. GPT-3.5 GPT-4 GPT-3.5 GPT-4 GPT-3.5 GPT-4
0 0 70.66 74.40 750 745 0.09 0.10
3 0 77.11 78.77 985 973 0.12 1.25
5 0 78.51 77.52 1097 1114 0.13 1.40
10 0 80.54 79.65 1334 1338 0.15 1.66
10 3 86.91 88.94 2274 2205 0.26 2.61
10 5 86.93 88.15 2776 2735 0.32 3.21
10 10 88.02 90.54 3975 3974 0.43 4.60

Comparison of LLMs and PLMs. We now compare the performance of GPT-3.5 and GPT-4 with the prompt template that uses ten example values and ten demonstrations to the PLM baselines SU-OpenTag [23], AVEQA [21] and MAVEQA [26]. The baselines are fine-tuned on the same training set used to select example values and task demonstrations for the LLM prompt templates. The results in Table 4 show that GPT-4 outperforms the best PLM baseline AVEQA by 10% F1. It is important to mention that the training set with 1,750 attribute-value pairs is small. The training sets of the existing benchmark datasets MAVE [26] and AE-110k [23] contain 3.8 million and 84.7 thousand attribute-value pairs. As shown by Brinkmann et al. [3], it can be expected that with additional training data the performance of the PLM baselines increases whereas the performance of the LLMs only marginally improves.

Table 4: Comparison of LLM- and PLM-based methods.
GPT-4 GPT-3.5 AVEQA MAVEQA SU-OpenTag
F1 90.54 88.02 80.83 65.10 60.44
Δ1subscriptΔ1\Delta_{1}roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to GPT-4 - -2.52 -9.71 -25.44 -30.10

5 Extraction with Normalization

This section evaluates prompt templates instructing LLMs to extract and normalize attribute values in a single step. To instruct the LLM on how to normalize attribute values, the prompt template in Figure 2 is extended with the texts set in red font. The extensions add normalization guidelines to the task description and mappings between example values and their normalized forms to the target schema. The attribute values in the demonstrations are normalized.

Discussion of Results. As in Section 4, we measure how adding zero, three, five and ten example values without demonstrations and ten example values with three, five and ten demonstrations to the prompt affects the performance of GPT-3.5 and GPT-4. Table 5 shows the results of the experiments. The best F1-scores per model are marked in bold. Without demonstrations, GPT-4 benefits from normalized example values and reaches an F1-score of 86%, which is 12% better than the zero-shot configuration. In contrast to the extraction task, example values only marginally improve the extraction and normalization performance of GPT-3.5. However, adding five demonstrations improves the performance of both LLMs, with GPT-4 achieving an F1-score of 91% and outperforming GPT-3.5 by 5%. The prompts for the extraction with normalization task are longer than the prompts for the extraction task because of the additional normalization instructions. At the same time, the best performance is achieved with 5 instead of 10 demonstrations, resulting in a cost reduction of 1$ per 1,000 extracted attribute values for GPT-4 (see rightmost column in Table 5).

Table 5: Results for the extraction with normalization task.
F1-score Avg. Tok. per Prompt $ per 1k Attr. Val.
Val. Dem. GPT-3.5 GPT-4 GPT-3.5 GPT-4 GPT-3.5 GPT-4
0 0 68.62 74.19 902 889 0.1043 1.0820
3 0 71.32 83.77 1156 1155 0.1327 1.3774
5 0 70.17 84.90 1316 1312 0.1505 1.5531
10 0 68.82 85.60 1715 1673 0.1954 1.9578
10 3 85.68 91.18 2708 2702 0.3057 3.1028
10 5 86.37 91.32 3040 3079 0.3428 3.5247
10 10 86.30 91.31 4024 4031 0.4529 4.5901

Analysis of Normalization Operations. We now analyze the performance of the LLMs on the normalization operations in detail. Table 8 shows the F1-scores per normalization operation for the prompt templates zero-shot , with ten example values and with ten example values and five demonstrations. Previous research has shown that GPT-3.5 performance is weaker on tasks requiring reasoning or calculation than on tasks involving manipulation of free text or names [8]. Our results support these observations. GPT-3.5 and GPT-4 are particularly strong at name expansion and string wrangling. GPT-4 achieves F1-scores of 98% and 97%. Unit of measurement conversion requires calculations and is the most challenging operation for GPT-3.5 and GPT-4. Example values and demonstrations improve GPT-4’s zero-shot performance by 22%, leading to an F1-score of 83.5%. Compared to the extraction task, we observe that generalizing attribute values simplifies the task for GPT-4, possibly because it can use its background knowledge to generalize the values. For attributes like ’Product Type’, we observe that GPT-4 benefits from the generalization by an average of 7% across all product categories if example values and demonstrations from a training set are provided.

Table 6: F1-scores by normalization operation for the extraction with normalization task.
GPT-3.5 GPT-4
Normalization Operation
0 Val. 10 Val.
10 Val.
5 Dem.
0 Val. 10 Val.
10 Val.
5 Dem.
Name Expansion 41.61 42.15 94.50 48.64 93.60 98.27
Generalization 75.27 76.63 82.16 76.48 79.82 88.56
Unit of Measurement Norm. 51.16 47.64 76.24 61.76 73.86 83.50
String Wrangling 87.07 83.78 93.37 92.41 97.37 95.19

6 Normalization

This section compares different prompt templates instructing LLMs to normalize attribute values which have been extracted by a separate preceding extraction step. As in Section 5, the prompt template in Figure 2 is extended by the text set in red font. The attribute values to be normalized are added to the task input block in addition to the product title and the description, which can be exploited as context in the normalization process.

Discussion of Results. Like in Section 4 and Section 5, we assess how zero, three, five and ten example values without demonstrations and ten example values with three, five and ten demonstrations selected from the training set impact the performance of GPT-3.5 and GPT-4. Table 7 shows the results of the experiments. The highest F1-scores are marked in bold. Similar as in the previous tasks, we observe that both example values and demonstrations improve the F1-scores of GPT-3.5 and GPT-4. The effect of adding demonstrations for GPT-3.5 is marginal while GPT-4 gains 4% F1-score.

Table 7: Results for the normalization task.
F1 Average length $ per 1K values
Val. Dem. GPT-3.5 GPT-4 GPT-3.5 GPT-4 GPT-3.5 GPT-4
0 0 82.81 86.41 974 974 0.1120 1.1813
3 0 89.99 92.00 1242 1260 0.1420 1.5015
5 0 91.16 92.11 1378 1376 0.1573 1.6318
10 0 90.61 92.44 1732 1732 0.1969 2.0292
10 3 91.68 95.76 2941 2961 0.3321 3.3931
10 5 90.98 96.06 3486 3548 0.3931 4.0495
10 10 90.94 96.21 4795 4811 0.5395 5.4624

Analysis of Normalization Operations. We now analyze the normalization operations in detail. Table 8 shows the F1-scores per normalization operation for the prompt templates zero-shot, with ten example values and with ten example values and five demonstrations. The results show that string wrangling can be well handled by LLMs if the attribute values have already been extracted. The other normalization operations require example values and demonstrations to reach high F1-scores. Compared to the extraction with normalization task, the unit of measurement conversion results for GPT-3.5 and GPT-4 improve by 17% and 14% if the values have already been extracted. This shows how challenging the combination of extraction and normalization for the LLMs is given that unit of measurement conversions need to be performed. In contrast, the generalization operation remains as challenging as in the extraction with normalization task.

Table 8: F1-scores by normalization operation for the normalization task.
GPT-3.5 GPT-4
Normalization Operation
0 Val. 10 Val.
10 Val.
5 Dem.
0 Val. 10 Val.
10 Val.
5 Dem.
Name Expansion 64.47 95.18 97.15 84.10 100.00 99.12
Generalization 73.58 80.67 79.81 78.19 82.04 90.72
Unit of Measurement Norm. 84.89 93.28 93.15 90.28 94.07 97.89
String Wrangling 97.30 96.71 97.70 93.68 98.81 99.32

Comparison across Tasks. We now analyze how the F1-scores change between (i) direct extraction, (ii) extraction with normalization and (iii) normalization. Table 9 shows the F1-scores of GPT-3.5 and GPT-4 for the zero-shot scenario and the few-shot scenario with ten example values and five demonstrations. In addition, the deltas between the extraction task and the extraction with normalization task (Δ1subscriptΔ1\Delta_{1}roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and the delta between the extraction task and the normalization task (Δ2subscriptΔ2\Delta_{2}roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) are shown. The results indicate that extraction is more challenging for LLMs than normalization. Zero-shot both GPT-3.5 and GPT-4 achieve 12% higher F1-scores if no extraction is required. With example values and demonstrations, GPT-3.5 and GPT-4 reach F1-scores that are 4% and 8% higher if no extraction is required.

Table 9: Comparison of F1-scores across tasks.
E     xtract &
Val. Dem. Model Extract Normalize Δ1subscriptΔ1\Delta_{1}roman_Δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT(Extr.) Normalize Δ2subscriptΔ2\Delta_{2}roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(Extr.)
0 0 GPT-3.5 70,66 68,62 -2,04 82,81 +12,15
0 0 GPT-4 74,40 74,19 -0,21 86,41 +12,01
10 5 GPT-3.5 86,93 86,37 -0,56 90,98 +4,05
10 5 GPT-4 88,15 91,32 3,17 96,06 +7,91

7 Related Work

Product Attribute Value Extraction. Early research on PAVE used domain-specific rules to extract attribute-value pairs [20, 27, 11] from product descriptions. The initial learning-based methods required extensive feature engineering and did not generalize to unknown attributes and values [6, 14]. Recent works have adopted BiLSTM-CRF architectures [10, 29] to tag attribute values in product titles. SU-OpenTag [23] builds upon OpenTag [29] by encoding both a target attribute and the product title using the PLM BERT [4]. [17, 18, 21, 26] approach PAVE as a question-answering task, using different PLMs to encode target attribute, category, and title. The PLM-based methods SU-OpenTag [23], AVEQA [21], and MAVEQA [26] serve as baselines for the direct extraction task.

LLMs for Attribute Value Extraction. LLMs have successfully been used for information extraction in various domains [1, 7, 12]. In the context of PAVE recent works experiment with different prompt designs for PAVE using LLMs [3, 5]. We use the prompt templates from [3] as a role model for our prompt templates. In contrast to related work  [2, 16, 25], we do not fine-tune the LLMs, but instead rely on in-context learning via demonstrations and example values. Early work on attribute value extraction and normalization applied domain-specific normalization rules [15, 19]. Jaimovitch-López et al. [8] experiment with using GPT-3.5 for attribute value normalization but present the values to be normalized without any context to the model. In contrast, we include the original titles and descriptions into the prompts in order to provide context for the normalization.

Benchmarks for Attribute Value Extraction and Normalization. The benchmarks MAVE [26] and AE-110k [23] are widely used to evaluate methods for PAVE. MAVE relies on an ensemble of five fine-tuned PLMs for determining ground truth annotates. AE-110k [23] uses values from product specification tables as ground truth. In contrast to these benchmarks, all attribute-value annotations in WDC-PAVE are manually verified. To our knowledge, OA-Mine [28] is the only other publicly available benchmark offering human-verified attribute-value annotations. MAVE, AE-110k, and OA-Mine address value extraction and do not consider value normalization. WDC-PAVE covers both tasks. [8] propose an attribute value normalization benchmark including operations such as transforming dates, units of measurement, or names. Unlike WDC-PAVE, their benchmark presents the values to be normalized without any context that can be exploited by the methods.

8 Conclusion

This paper investigated the ability of GPT-3.5 and GPT-4 to extract and normalize product attribute values from product offers. We experimented with different prompt templates that use example values and demonstrations for in-context learning. We introduced the WDC-PAVE benchmark, which features manually verified ground truth values for attribute value extraction as well as value normalization. GPT-4 achieves the best F1-score of 91% in the extraction task, surpassing the best PLM baseline by 10%, and shows similar performance for the extraction with normalization task. A compelling avenue for future research is to give LLMs access to scale-specific functions that the model can decide to invoke for normalizing values.

References

  • [1] Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., Sontag, D.: Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. pp. 1998–2022 (2022)
  • [2] Blume, A., Zalmout, N., Ji, H., Li, X.: Generative Models for Product Attribute Extraction. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. pp. 575–585 (2023)
  • [3] Brinkmann, A., Shraga, R., Bizer, C.: Product Attribute Value Extraction using Large Language Models. arXiv preprint arXiv:2310.12537 (2023)
  • [4] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 4171–4186 (2019)
  • [5] Fang, C., Li, X., Fan, Z., Xu, J., Nag, K., et al.: LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction (2024), arXiv:2403.00863 [cs]
  • [6] Ghani, R., Probst, K., Liu, Y., Krema, M., Fano, A.: Text mining for product attribute extraction. ACM SIGKDD Explorations Newsletter 8(1), 41–48 (2006)
  • [7] Goel, A., Gueta, A., Gilon, O., Liu, C., Erell, S., et al.: LLMs Accelerate Annotation for Medical Information Extraction. In: Proceedings of the 3rd Machine Learning for Health Symposium. pp. 82–100 (2023)
  • [8] Jaimovitch-López, G., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M.J.: Can language models automate data wrangling? Mach Learn 112(6), 2053–2082 (2023)
  • [9] Jain, M., Bhattacharya, S., Jain, H., Shaik, K., Chelliah, M.: Learning cross-task attribute-attribute similarity for multi-task attribute-value extraction. In: Proceedings of the 4th Workshop on e-Commerce and NLP. pp. 79–87 (2021)
  • [10] Kozareva, Z., Li, Q., Zhai, K., Guo, W.: Recognizing salient entities in shopping queries. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 107–111 (2016)
  • [11] Nederstigt, L.J., Aanen, S.S., Vandic, D., Frasincar, F.: FLOPPIES: A Framework for Large-Scale Ontology Population of Product Information from Tabular Data in E-commerce Stores. Decision Support Systems 59, 296–311 (2014)
  • [12] Parekh, T., Hsu, I.H., Huang, K.H., Chang, K.W., Peng, N.: Geneva: Benchmarking generalizability for event argument extraction with hundreds of event types and argument roles. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. pp. 3664–3686 (2023)
  • [13] Primpeli, A., Peeters, R., Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of The 2019 World Wide Web Conference. pp. 381–386 (2019)
  • [14] Putthividhya, D., Hu, J.: Bootstrapped named entity recognition for product attribute extraction. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. pp. 1557–1567 (2011)
  • [15] van Rooij, G., Sewnarain, R., Skogholt, M., van der Zaan, T., Frasincar, F., et al.: A Data Type-Driven Property Alignment Framework for Product Duplicate Detection on the Web. In: Proceedings of 17th International Web Information Systems Engineering Conference. pp. 380–395 (2016)
  • [16] Roy, K., Goyal, P., Pandey, M.: Exploring generative frameworks for product attribute value extraction. Expert Systems with Applications 243, 122850 (2024)
  • [17] Sabeh, K., Kacimi, M., Gamper, J.: CAVE: Correcting Attribute Values in E-commerce Profiles. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management. pp. 4965–4969 (2022)
  • [18] Shinzato, K., Yoshinaga, N., Xia, Y., Chen, W.T.: Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product Attribute Extraction. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. pp. 227–234 (2022)
  • [19] Valstar, N., Frasincar, F., Brauwers, G.: APFA: Automated product feature alignment for duplicate detection. Expert Systems with Applications 174, 114759 (2021)
  • [20] Vandic, D., Van Dam, J.W., Frasincar, F.: Faceted product search powered by the semantic web. Decision Support Systems 53(3), 425–437 (2012)
  • [21] Wang, Q., Yang, L., Kanagal, B., Sanghai, S., Sivakumar, D., et al.: Learning to extract attribute value from product via question answering: A multi-task approach. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 47–55 (2020)
  • [22] Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)
  • [23] Xu, H., Wang, W., Mao, X., Lan, M.: Scaling up open tagging from tens to thousands: Comprehension empowered attribute value extraction from product title. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 5214–5223 (2019)
  • [24] Yan, J., Zalmout, N., Liang, Y., Grant, C., Ren, X., et al.: AdaTag: Multi-Attribute Value Extraction from Product Profiles with Adaptive Decoding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. pp. 4694–4705 (2021)
  • [25] Yang, L., Wang, Q., Wang, J., Quan, X., Feng, F., et al.: MixPAVE: Mix-Prompt Tuning for Few-shot Product Attribute Value Extraction. In: Findings of the Association for Computational Linguistics: ACL 2023. pp. 9978–9991 (2023)
  • [26] Yang, L., Wang, Q., Yu, Z., Kulkarni, A., Sanghai, S., et al.: Mave: A product dataset for multi-source attribute value extraction. In: Proceedings of the 15th ACM International Conference on Web Search and Data Mining. pp. 1256–1265 (2022)
  • [27] Zhang, L., Zhu, M., Huang, W.: A Framework for an Ontology-based E-commerce Product Information Retrieval System. J. Comput. 4(6), 436–443 (2009)
  • [28] Zhang, X., Zhang, C., Li, X., Dong, X.L., Shang, J., et al.: OA-Mine: Open-World Attribute Mining for E-Commerce Products with Weak Supervision. In: Proceedings of the ACM Web Conference 2022. pp. 3153–3161 (2022)
  • [29] Zheng, G., Mukherjee, S., Dong, X.L., Li, F.: OpenTag: Open Attribute Value Extraction from Product Profiles. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 1049–1058 (2018)