NutriBench: A Dataset for Evaluating Large Language Models in Carbohydrate Estimation from Meal Descriptions

Andong Hua1   Mehak Preet Dhaliwal1  Ryan Burke   Yao Qin1
1
University of California, Santa Barbara
{dongx1997,mdhaliwal}@ucsb.edu, [email protected], [email protected]
Equal contribution, alphabetically ordered.
Abstract

Accurate nutrition estimation helps people make informed decisions about their dietary choices and is crucial for preventing serious health issues. We present NutriBench, the first publicly available natural language meal description based nutrition benchmark. NutriBench consists of 5,000 human-verified meal descriptions with macro-nutrient labels, including carbohydrates, proteins, fats, and calories. The data is divided into 15 subsets varying in complexity based on the number, servings, and popularity of the food items in the meals and the specificity of serving size descriptions. We conducted an extensive evaluation of seven popular and state-of-the-art Large Language Models (LLMs), including GPT-3.5, Llama-3, and a medical domain-specific model with standard, Chain-of-Thought and Retrieval-Augmented Generation strategies on NutriBench  for carbohydrate estimation. We also conducted a human study involving expert and non-expert participants and found that LLMs can provide more accurate and faster predictions over a range of complex queries. We present a thorough analysis and comparison of different LLMs, highlighting the opportunities and challenges of using LLMs for nutrition estimation in real-life scenarios. Our benchmark is publicly available at: https://mehak126.github.io/nutribench.html

1 Introduction

Effective nutrition monitoring and dietary management are essential components of healthcare, closely linked to the prevention and control of chronic diseases, including obesity, heart disease, diabetes, and certain cancers [8, 27, 29]. For example, it is critical for patients with diabetes to estimate carbohydrates in meals to determine required insulin doses [13, 9]. Inaccurate self-estimation of meal carbohydrates can lead to high blood sugar (hyperglycemia) or low blood sugar (hypoglycemia) events, which can cause severe short and long-term health issues [12, 31].

Despite technological advancements in dietary assessment approaches, self-reporting nutrition estimation suffers from limited accuracy and high user burden [29, 34, 37]. A major limitation is that most modern nutrition datasets typically include tabular data  [2, 4, 23, 6, 5, 1] or meal images paired with nutrition information [13, 30, 24, 32, 36, 28, 26] but often lack natural language descriptions, restricting their usage and flexibility. For example, tabular database searches typically require an exact match for successful retrieval and require multiple searches for meals with multiple food items, making the process time-consuming and burdensome [18]. In addition, image processing-based nutrition estimation systems are restricted to real-time predictions, pose privacy concerns [18] and may encounter issues with food components being obscured in the image.

In contrast, describing meals using natural everyday language offers more flexibility, allowing users to explain various meal components and serving amounts in detail. We propose that Large Language Models (LLMs) are a valuable tool for nutrition estimation from natural language meal descriptions due to their advanced language understanding and reasoning capabilities, combined with vast internal knowledge and ability to refer to external sources to provide precise nutrition estimates. However, there are no available existing datasets to evaluate LLMs on this task.

Refer to caption
Figure 1: GPT-3.5 answers a query from NutriBench using different prompting strategies.

To this end, we present NutriBench, a dataset consisting of 5,000 natural language meal descriptions with macronutrient and calorie annotations. To our knowledge, this is the first publicly available benchmark for evaluating the performance of LLMs on nutrition estimation from meal descriptions. We construct NutriBench with 15 subsets varying in complexity based on the number, servings, and popularity of food items in the meal. This allows us to evaluate LLMs across various challenging real-world scenarios. Figure 3 shows the pipeline of the construction process of NutriBench.

Refer to caption
Figure 2: Accuracy versus Answer Rate for the evaluated LLMs across different prompting strategies on NutriBench.

With our human-verified NutriBench, we evaluate seven popular and state-of-the-art LLMs including GPT-3.5 [7], Llama-3 [3], and a domain-specific medical model, MedAlpaca [17], across four prompting paradigms including Chain-of-Thought (CoT) [39] and Retrieval Augmented Generation (RAG) [21] for carbohydrate estimation. An example of GPT-3.5’s output using different prompting strategies is shown in Figure 1. In addition, we summarize LLMs’ performance in Figure 2, where GPT-3.5 with CoT prompting achieves the highest accuracy of 51.48%, with an answer rate of 89.80%. Surprisingly, most LLMs outperformed even a professional nutritionist and non-experts based on our human study while also providing much faster estimates and accommodating a wide range of complex queries. This demonstrates the great potential of LLMs for nutrition estimation from meal descriptions, providing humans with dietary guidance and improving health outcomes.

Our contributions can be summarized as follows:

  • NutriBench: We present NutriBench - the first publicly available natural language meal description dataset labeled with macronutrient and calorie estimates. NutriBench consists of 5,000 meal descriptions with 15 subsets varying in complexity.

  • Benchmarking LLMs: We conducted 300 experiments across seven LLMs varying in size and expertise with different prompting strategies including CoT and RAG. This provides a comprehensive insight into the current capabilities of LLMs in nutrition estimation.

  • Insightful Analysis: We compare and analyze the relative performance across the different NutriBench subsets, LLMs, and prompting strategies. Additionally, we conduct a human study and find that LLMs can even outperform a professional nutritionist in carbohydrate estimation.

Refer to caption
Figure 3: Overview of the construction process of NutriBench.

2 Related Work

Nutrition Estimation

Most mainstream nutrition datasets such as FoodData Central (FDC) [2], MenuStat [6], FoodCom [23], and Nutritionix [4] feature tabular food nutrition information curated from diverse sources. However, tabular datasets based retrieval methods require users to use precise terminology to ensure successful and relevant retrieval. In addition, these methods do not support retrieving multiple items that may be present in a meal in a single search, making the process time-consuming and burdensome [18].

Another popular method for nutrition estimation involves predicting nutritional values from food images. Such approaches may involve identifying food items followed by retrieval from a tabular nutrition database [41] or directly estimating the meal’s nutritional breakdown from the image [19]. Many existing datasets contain images paired with nutrition information to facilitate research in this direction. These datasets may include images obtained from the web  [30], real world, [24, 32, 36], or synthetically generated images  [28]. UMDFood90k [26] provide a multimodel dataset with product images, text-based ingredient statements, and nutrient amounts. However, image-based nutrition approaches are time-sensitive [18], requiring users to capture specific pictures of their meals at the time of consumption and may encounter issues with food components being obscured in the image.

Enabling users to input meal descriptions in natural everyday language can help mitigate these issues. Initial exploration in this realm includes [20, 34], who utilize Convolutional Neural Networks to match food items from crowdsourced meal descriptions with an external tabular nutrition database. However, they do not make their data public, preventing the evaluation of current state-of-the-art language processing approaches, such as LLMs, for this task.

Motivated by the lack of a standardized benchmark for nutrition estimation based on natural language meal descriptions, we introduce NutriBench, the first publically available nutrition benchmark, and evaluate state-of-the-art LLMs to gain insight into their current capabilities and limitations.

Large Language Models (LLMs)

Large Language Models (LLMs) have made significant progress in recent years, with both closed-source models like GPT-3/4 [7], Gemini [35] and open-source models like Llama 2 [38], Llama 3 [3], and Alpaca [33] enabling advancements in natural language processing as well as knowledge-intensive, reasoning, and cross-domain tasks including healthcare applications  [16]. Despite their extensive internal knowledge, LLMs still suffer from issues such as hallucinations, incorrect or outdated information, and a lack of interpretability  [25, 22, 42].

Chain-of-Thought (CoT) [39] prompting alleviates some of these issues by enabling models to reason about the answer step-by-step. Another promising solution is Retrieval-Augmented Generation (RAG) [21], which provides the model with additional context relevant to the query by retrieving information from an external, reliable knowledge source. While previous works have identified shortcomings of both these approaches [10, 40], they have not been assessed on the task of nutrition estimation, which may uncover unique challenges.

Our work comprehensively evaluates 7 state-of-the-art LLMs with standard, CoT, RAG, and RAG combined with CoT prompting for carbohydrate estimation from natural language meal descriptions. We present a detailed analysis of how different prompting strategies affect LLMs’ performance in nutrition estimation in Section 5.

3 Dataset Construction

In this section, we describe our process of constructing our natural language meal description benchmark, NutriBench. Figure 3 provides an overview of the different stages in our pipeline.

3.1 Data Curation

The FoodData Central (FDC) [2] is the food composition information center of the US Department of Agriculture (USDA) [14]. It is a semi-annually updated data repository and we obtained the most recent version of the data at the time of writing (April 2024) to compile food items and their nutrient labels in our dataset. We retained entries that provided complete macro-nutrient information, including carbohydrates, proteins, fats, and calories. Overall, we retained a total of 1,788,981 entries from the database containing all macro-nutrients. Among these, 465,059 items had unique food names, and 1,286,036 items were associated with a specific brand.

3.2 Data Cleaning and Filtering

We conducted two rounds of outlier-based filtering to clean the data, with a primary focus on carbohydrates. In the first round, we eliminated variability among entries with the same food name and brand. Since we expect these entries to have near-similar nutrition estimates, we applied a strict z-score-based filtering approach, removing estimates with a score greater than 1, and selected the median of the remaining estimates as the final carbohydrate label. For same-brand items with only two entries, we kept their mean as the final label if their relative difference was less than 0.1. The second round filtered out extreme outliers for the same food name across different brands. In this step, we allowed for more variability and increased the z-score and relative difference thresholds to 2 and 0.3 respectively. Finally, we were left with 756,594 entries including 451,555 unique food names.

3.3 Extracting Natural Serving Descriptions

The standard nutrition measurement in the FDC database is based on a normalized quantity of 100g, denoted as “Metric Serving”. However, typical meal descriptions in natural language rarely include such precise measurements. For instance, it is more common to refer to “1 cup of rice" rather than “100g of rice" in day-to-day life. We extracted such natural servings from the FDC database, denoted as “Natural Serving” in this paper. In summary, our cleaned FDC database has 756,594 food items with varying brands and sources. Among them, 13,048 food items have additional natural serving descriptions with their corresponding weight in grams.

Table 1: Dataset size of cleaned FDC, Retri-DB and NutriBench.
FDC Retri-DB NutriBench
Size 756,594 451,058 5000

Next, we use the cleaned and filtered FDC database to create our retrieval database Retri-DB  and our evaluation benchmark NutriBench . The data sizes are in Table 1.

3.4 Retrieval Database Construction

For our experiments with Retrieval-Augmented Generation, we construct a retrieval database, Retri-DB, to provide relevant external nutrition knowledge. For each food item, we compile the nutrition information available across different brands and for different servings into a comprehensive list of facts. If multiple brands provide estimates for a specific serving of a food item, we use the median as the reference value. Since models process unstructured text information more efficiently than tabular data [11], we use a rule-based transformation to convert the context to natural language. Specifically, for an entry containing a servingamount𝑠𝑒𝑟𝑣𝑖𝑛𝑔𝑎𝑚𝑜𝑢𝑛𝑡serving\ amountitalic_s italic_e italic_r italic_v italic_i italic_n italic_g italic_a italic_m italic_o italic_u italic_n italic_t and nutritionamount𝑛𝑢𝑡𝑟𝑖𝑡𝑖𝑜𝑛𝑎𝑚𝑜𝑢𝑛𝑡nutrition\ amountitalic_n italic_u italic_t italic_r italic_i italic_t italic_i italic_o italic_n italic_a italic_m italic_o italic_u italic_n italic_t for a particular nutrient𝑛𝑢𝑡𝑟𝑖𝑒𝑛𝑡nutrientitalic_n italic_u italic_t italic_r italic_i italic_e italic_n italic_t, we convert it to the string "servingamount𝑠𝑒𝑟𝑣𝑖𝑛𝑔𝑎𝑚𝑜𝑢𝑛𝑡serving\ amountitalic_s italic_e italic_r italic_v italic_i italic_n italic_g italic_a italic_m italic_o italic_u italic_n italic_t has nutritionamount𝑛𝑢𝑡𝑟𝑖𝑡𝑖𝑜𝑛𝑎𝑚𝑜𝑢𝑛𝑡nutrition\ amountitalic_n italic_u italic_t italic_r italic_i italic_t italic_i italic_o italic_n italic_a italic_m italic_o italic_u italic_n italic_t nutrient𝑛𝑢𝑡𝑟𝑖𝑒𝑛𝑡nutrientitalic_n italic_u italic_t italic_r italic_i italic_e italic_n italic_t". Figure 1 shows an example of retrieved context for the two food items in the input query.

3.5 NutriBench  Construction

To construct NutriBench , we generate 5,000 natural language meal descriptions from the cleaned FDC nutrition dataset, divided into 15 subsets. These subsets vary in the number of food items, serving sizes, and natural vs. metric serving descriptions and include both directly and indirectly retrievable food items to enable a comprehensive assessment of LLMs’ capabilities on this task. We also include both common and uncommon food items in NutriBench. The overall construction pipeline for NutriBench is depicted in Figure 3, with a summary of each subset sizes presented in Table 2. More detailed examples are available in the Appendix.

Increasing Number of Food Items

To evaluate the models’ capacity to handle multiple food items within a single meal description, we constructed three subsets containing single, double, and triple different food items in one meal description. For the double and triple food item subsets, we randomly sample two or three items from the single food item subset to create the combinations.

Increasing Number of Servings

We also vary the number of serving sizes to evaluate whether LLMs exhibit mathematical and logical reasoning capabilities on this task. A ‘single’ natural serving represents ‘1’ unit food item, e.g., 1 apple, whereas a ‘single’ metric serving represents ‘100g’, e.g., 100g of apple. We convert these standardized single serving amounts by scaling the measures and the corresponding nutrition content by values sampled from a range of 0.25-48 for natural servings and 10-120 for metric servings. We refer to these as ‘Multiple’ servings. For multi-item meal descriptions, we first scale the serving size of each item and then combine them, e.g., “I ate two apples and half a toast for breakfast.”

Natural vs. Metric Serving Descriptions

In the real world, people may choose to describe food items using either natural language (e.g., “a cup of latte”) or precise metric measurements (e.g., “100g of latte”). To reflect this diversity, we create distinct subsets containing food descriptions using natural or metric servings in a ratio of 7:3, favoring natural serving descriptions due to their higher frequency of day-to-day use. In Section 5.2, we explore how these distinct serving descriptions can significantly influence the performance of LLMs and retrieval-based methods.

Direct vs. Indirect Retrieval Subsets
Table 2: NutriBench comprises 15 distinct evaluation subsets, each varying in the number of food items, serving sizes, natural and metric serving descriptions, and including both directly and indirectly retrievable food items from Retri-DB.
Number of Food Item Single Double Triple
Number of Serving Single Multiple Single Multiple Single
Natural Serving Direct Retrieval 500 500 500 500 500
Indirect Retrieval 200 200 200 200 200
Metric Serving Indirect Retrieval 300 300 300 300 300

To evaluate the performance of RAG based methods, we use Retri-DB introduced in Section 3.4 to provide relevant external nutrition information to the models. We divide NutriBench into two equal parts: one containing ‘Direct Retrieval’ food items, where the food items can be directly retrieved from Retri-DB with exact food name matches but different serving descriptions, and the other with ‘Indirect Retrieval’ food items, where no direct match exists between the queried food item and those in Retri-DB. The Direct Retrieval subset is used to assess the model’s ability to convert metric servings retrieved from the database (e.g., “100g of rice”) to natural servings used in meal descriptions (e.g., “1 cup of rice”). In contrast, the Indirect Retrieval subset evaluates the model’s capability to retrieve similar food items from the Retri-DB and use them as nutrition context for knowledge-grounded carbohydrate estimation.

3.5.1 Commonness-based Sampling

To ensure our NutriBench includes both common and uncommon foods, we propose commonness-based sampling for constructing Indirect Retrieval subsets 111Commonness-based sampling is not necessary for Direct Retrieval subsets since the food names can be directly retrieved from Retri-DB.. Specifically, we quantify the commonness score using the embedding similarity of food names: the higher the similarity to other food items, the more common the food. We employ OpenAI’s “text-embedding-3-large” model to extract food name embeddings and compute a similarity matrix against all other food items in our database. The second-highest similarity score (the highest being 1 for the item itself) is used as the commonness score. We set a threshold of 0.75 to distinguish between uncommon and common foods and then randomly sample from these two groups equally for Indirect Retrieval subsets. Further details of our commonness-based sampling process are presented in Appendix C.

3.5.2 Generating Natural Language based Meal Descriptions

GPT-3.5 Based Generation

We instruct GPT-3.5 to generate natural language meal descriptions (aka queries) from the sampled food items to create NutriBench . To encourage diversity, we prompt the LLM to produce five varied meal descriptions for each food item in a single generation, from which we randomly select one as the final query. When increasing the number of food items in a query, we instruct the LLM to combine two or three single-item queries into a combined meal description. Details of the prompts used can be found in the Appendix.

Two-Round Human Verification

Although GPT-3.5 can generate meal descriptions, it may occasionally produce outputs with incorrect food names or missing serving sizes. To this end, we conduct two rounds of human verification. First, a human evaluator reviews each generated meal description to manually correct the accuracy of the food name and serving size. In the second round, another evaluator re-examines the entire dataset. This ensures NutriBench contains high-quality natural language meal descriptions. Examples are displayed in Figure 1 and the Appendix.

4 Experimental Setup

4.1 LLM Models

In this work, we conduct a comprehensive evaluation of seven state-of-the-art large language models (LLMs) using our proposed NutriBench. The evaluation spans models of varying sizes, ranging from small-scale models with 7 billion parameters to large-scale models with 175 billion parameters. Additionally, we compare models with integrated general medical knowledge to those without such specialized information. The evaluated LLMs are introduced as follows:

  • GPT-3.5 [7]: GPT-3.5-Turbo from OpenAI is a closed-source model, accessible via an API.

  • Llama2-7B and Llama2-70B [38]: Llama-2-7B-chat and Llama-2-70B are open-source instruction-tuned models from Meta.

  • Llama3-8B and Llama3-70B [3]: We also evaluate the advanced Llama models, Llama-3-8B-Instruct and Llama-3-70B. Notably, Llama-3-70B competes with GPT-3.5 in human evaluations.

  • Alpaca-7B [33]: Alpaca-7B, developed by Stanford, fine-tunes Llama-7B with 52K instruction-following examples

  • MedAlpaca-7B [17]: In comparison to Alpaca-7B, MedAlpaca-7B fine-tunes Llama with medical data including established medical NLP tasks as well as various internet resources.

4.2 Prompt Methods

In this section, we introduce how we adapt four existing prompting methods with carefully designed prompts tailored for carbohydrate estimation. They are:

  • Base: The first baseline involves instructing LLMs to estimate the carbohydrate content based on the meal description provided in the query with basic instructions.

  • Chain-of-Thought (CoT) [39]: Since our data includes complex queries with multiple items in varying quantities for a meal description, we hypothesize that the step-by-step reasoning induced by chain-of-thought prompting would reduce model errors by enabling the model to identify and reason about individual query components required to make the overall decision.

  • Retrieval-Augmented Generation (RAG) [21]: To further enhance the reliability of LLM, we use RAG to ground their predictions with nutrition knowledge retrieved from Retri-DB. First, for a given meal query, we prompt the model to parse it into individual food components. Next, we retrieve nutrition information about each food item in the query through a nearest neighbor semantic similarity search and concatenate the results to form a comprehensive set of facts about the food components in the query. Finally, we provide this retrieved context along with the original prompts for LLMs.

  • RAG+CoT: We combine the nutrition retrieval capability of RAG with step-by-step reasoning in CoT by concatenating the retrieved nutrition context with the CoT prompting for LLMs.

In all the cases mentioned above, we instruct the models to respond with ‘-1’ if they don’t know the answer to reduce the risk of potentially harmful predictions. Figure 1 shows different outputs of GPT-3.5 using the four different paradigms. We apply RAG and RAG+CoT on GPT-3.5, Llama3-8B, and Llama3-70B only due to computation constrain. The prompts for each method are in the Appendix.

4.3 Evaluation Metrics

We calculate the mean absolute error (MAE) to measure the deviation of the model responses from the true carbohydrates. In addition, we report accuracy ([email protected]) by considering the model output as ‘correct’ if the predicted value is within ±7.5g of the ground truth. This is based on the insulin-to-carbohydrate ratio, which indicates the grams of carbohydrates one unit of insulin can cover. While this ratio varies among individuals, it is generally considered 1:15 as a rule of thumb 222https://www.tidepool.org/blog/optimizing-insulin-to-carb-ratios. Since we aim to improve insulin management and avoid even half-unit insulin dosage errors, we maintain a conservative threshold of 7.5g on the absolute error to measure accuracy.

Finally, since we allow the models not to provide an estimate if uncertain, we also report the answer rate (AR), indicating the percentage of answered questions. Overall, the models should have a high AR and [email protected], and a low AE.

5 Evaluating LLMs on NutriBench

In this section, we evaluated the performance of seven LLM models and four prompt methods on our natural language description based nutrition benchmark: NutriBench. We specifically focused on carbohydrate estimation due to its pivotal role in blood glucose management for diabetes. We anticipate that the insights derived from carbohydrate estimation will be applicable to other nutritional components, such as proteins, fats, and calories, included in NutriBench.

We begin by summarizing the general guidelines that apply across all 15 subsets in NutriBench. As shown in Figure 2, we find that:

  • Among seven LLMs, GPT-3.5 generally outperforms the others, with Llama3-70B ranking second.

  • There is a trade-off between accuracy and answer rate. For instance, GPT-3.5 with CoT achieves a 51.48% [email protected], outperforming all other methods, but tends to be more conservative in carbohydrate estimation, resulting in a lower answer rate.

  • Increasing the model size of LLMs generally improves both accuracy and answer rate, as observed when comparing Llama models with 7 or 8 billion parameters to those with 70 billion.

  • The medical LLM, such as Medalpaca, performs better than its non-medical counterpart, Alpaca, but does not surpass more advanced LLMs. We hypothesize that while medical data can help nutritional estimation, more advanced LLMs possess a stronger ability to comprehend natural language based meal descriptions.

In the following sections, we compare performance on specific NutriBench  subsets to analyze how different LLMs and prompt methods affect carbohydrate estimation.

5.1 CoT Improves both Answer Rate and Accuracy, Especially on Challenging Queries

In this section, we investigate how step-by-step reasoning induced by CoT helps LLMs in carbohydrate estimation. First, we discover that CoT consistently improves both answer rate and accuracy across 15 subsets and 7 LLMs, as demonstrated in Table 4. We further analyze the queries answered by CoT but not by Base, and observe a higher MAE compared to queries answered by both, shown in Table 4. This indicates that CoT enables LLMs to answer challenging queries, aligned with the observations in [39] that step-by-step reasoning can effectively help LLMs tackle difficult problems.

Table 3: CoT improves both Answer Rate and [email protected]. Accuracy and answer rate are averaged over 15 subsets and 7 models for Base and CoT, and over 3 models (GPT-3.5, Llama3 8B, and Llama3 70B) for RAG and RAG+CoT.
Base CoT RAG CoT+RAG
Answer Rate 79.21 96.75 97.91 97.17
[email protected] 30.70 35.88 42.08 45.49
Table 4: CoT enables LLMs to answer challenging queries as the MAE of queries that CoT Answered Only is higher than those both answered. We report the MAE of GPT-3.5 where CoT significantly improves the answer rate.
Base CoT
Both Answered 15.92 13.34
CoT Answered Only NA 15.92
Error Analysis

Since the intermediate steps with chain-of-thought prompting add a layer of interpretability to the models’ reasoning process, we manually reviewed failure examples with high MAE errors. Approximately 80% of model errors come from erroneous carbohydrate predictions, possibly due to incorrect prior knowledge or hallucinations by the model. The remaining errors arise from misidentifying either individual food items or the serving size in a query. Notably, none of the errors were due to mathematical calculation mistakes. This indicates that CoT is sufficient to handle the mathematical complexity of nutrition estimation. Detailed error analysis is in the Appendix.

Refer to caption
Figure 4: RAG only helps on queries with metric serving. We average the results across various food items and serving sizes. Direct retrieval does not improve [email protected] compared to indirect retrieval, indicating that LLMs struggle to convert one serving description to another. However, when both the query and context contain metric servings, RAG improves accuracy.

5.2 RAG only Helps with Aligned Serving Descriptions

In this section, we investigate how external nutrition knowledge retrieved by RAG aids LLMs in making predictions. We compare results for natural and metric servings in Figure 4. Surprisingly, despite direct retrieval providing nutrition information with exact food names for different serving descriptions, it still degrades performance. This suggests that LLMs struggle to convert one serving description to another, particularly from metric to natural serving, even when our designed prompt explicitly instructs the model to convert the servings. In contrast, RAG consistently improves both [email protected] and AR for metric serving queries, even without directly retrieving the exact food names from Retri-DB. Considering most food items in Retri-DB only have metric serving descriptions, we conclude that RAG can provide LLMs with useful nutrition knowledge by retrieving similar, but not necessarily identical, food items, as long as the serving descriptions are aligned. Based on these findings, a promising future direction is to augment Retri-DB with more natural servings for RAG to support knowledge-grounded predictions with LLMs.

5.3 LLMs Excel in Multi-item Queries but Struggle with Multi-Serving Queries

Multi-Item Analysis

We compare the average model performance across the single, double, and triple food subsets with single servings in NutriBench. As shown in Table 6, the unnormalized MAE increases with the number of food items in a query as the task becomes more challenging. We also measure the normalized MAE corresponding to a single food item with a single serving. For each query in a multiple food subset, if all food items are answered in the corresponding single food subset, we average the MAE of each item from the single food subset. Additionally, we divide the unnormalized MAE by the number of food items in double or triple food subsets. Surprisingly, we observe an opposite trend - the models exhibit a lower normalized MAE over multi-item subsets compared to single-item subsets. We hypothesize pairing foods together provides additional context for predictions.333We eliminate the impact of ensemble by predicting single foods multiple times and then averaging the results, but we do not observe a difference compared to a single prediction. This suggests that providing complete descriptions of meals, including all items in a single query, can be more accurate than prompting LLMs multiple times for each individual item.

Table 5: Multi-item subsets have lower normalized MAE, averaged over 7 LLMs and 4 methods.
Single Double Single Tripple
Unnormalized 18.85 27.24 18.85 38.78
Normalized 18.81 13.65 19.06 13.18
Table 6: Multi-serving subsets have higher normalized MAE, averaged over 7 LLMs and 4 methods.
Single Multiple
Unnormalized 28.26 19.63
Normalized 28.29 34.62
Multi-Serving Analysis

We compare prediction errors across subsets containing single food items with either single or multiple serving sizes. In Table 6, we present both the unnormalized MAE corresponding to each meal description, and the normalized MAE, which divides the error of multiple serving queries by the serving multiplication factor. Although we observe a lower unnormalized MAE for multiple serving subsets, the normalized MAE is higher compared to single serving subsets. In our error analysis in the Appendix, we observed that the majority of errors for multiple serving subsets stem from inaccurate initial predictions rather than calculation mistakes. Based on this, we hypothesize that the higher normalized error observed for the multiple servings set is due to the uncommon serving sizes. Online nutrition databases, like FDC, typically provide nutrition estimates for standardized serving sizes, such as per 100g. Therefore, LLMs are likely trained on data featuring nutrition estimates for 100g servings. However, when tasked with carbohydrate estimation for variable serving sizes, LLMs may struggle to effectively use this prior knowledge.

Refer to caption
Figure 5: Food with high carbohydrates has large MAE.

5.4 High-carbohydrate Foods Lead to Large Prediction Error

We examined the relationship between the prediction error and the true carbohydrate content. Figure 5 shows the histogram of the true carbohydrate content for single-item, single-serving queries. For each bin, we measure and plot the average MAE (in red). We observe a positive correlation: the MAE increases as the ground truth carbohydrate content rises. This indicates that LLM predictions are likely to be more accurate for individuals with a generally low-carbohydrate diet compared to those with a high-carbohydrate diet.

5.5 LLMs Outperforms Nutritionist in Accuracy, Speed and Stress Reduction

We conduct a voluntary human study on carbohydrate estimation involving 10 non-expert laypersons and 1 nutritionist. Among the ten laypersons, we include 1 patient with Type-1 diabetes for over 10 years, individuals with a general understanding of nutrition (including calorie awareness but not carbohydrates), and others without any particular focus on nutrition knowledge. We randomly sample 6 meal descriptions from each subset to create a final test set of 90 queries. To make a fair comparison with LLMs, we explicitly instruct the participants: Do not search online or use nutrition apps for carbohydrates. Additionally, we provide all participants with three meal descriptions with corresponding carbohydrates, identical to the few-shot examples given in the LLM prompts.

The results show an interesting finding: the professional nutritionist could not outperform advanced LLMs in carbohydrate estimation,444LLMs outperforms humans when evaluating the exactly same 90 queries. as illustrated in Figure 2. In addition, it takes the nutritionist in total of 50 minutes to complete all 90 queries 555This potentially includes time spent searching for information on uncommon foods. However, LLMs can answer all 90 queries within minutes , e.g., GPT-3.5 completed them in 2 minutes. Lastly, when the meal description becomes more complicated, participants experience significantly heightened stress in processing the information. In contrast, there is no difference when LLMs process longer meal descriptions. Taken together, we conclude that LLMs exhibit significant potential in addressing this challenging yet critical task.

6 Conclusion

In this study, we presented NutriBench, the first publicly available benchmark for evaluating the performance of LLMs on nutrition estimation from natural language meal descriptions. NutriBench contains 15 distinct subsets with in total 5,000 human-verified meal descriptions, representing various challenging scenarios likely to be encountered in the real world. We conducted 300 experiments to evaluate seven state-of-the-art LLMs on the NutriBench and discover that GPT-3.5 with CoT achieves the highest accuracy, significantly outperforming even human experts with professional nutritional knowledge. Our benchmark not only highlights the capabilities of existing LLMs but also provides a robust platform for future studies in this vital area. A limitation of our study is that the scale of the dataset may not be large enough, which we plan to address in future iterations. We hope the insightful findings in this work will inspire researchers to develop domain-specific LLMs for nutrition estimation, ultimately contributing to improved dietary choices and overall health outcomes.

7 Acknowledgements

We extend our sincere gratitude to Xuan Yang for the valuable initial discussions, data survey, and construction efforts for this project. Additionally, we thank Yifan Wei for human verification to ensure the quality of the benchmark. We are also grateful to Xuezhi Wang for the insightful discussion and interpretation of our results. Finally, we thank Andrew Koutnik for providing perspectives and nutrition estimates for the human study as a professional nutritionist, as well as our non-expert human study participants. Their collective contributions were essential to the success of our work.

References

  • [1] Foodb. https://foodb.ca/. Accessed: 2024-06-05.
  • [2] Fooddata central. https://fdc.nal.usda.gov/. Accessed: 2024-06-05.
  • [3] Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-06-04.
  • [4] Nutritionix database. https://www.nutritionix.com/database. Accessed: 2024-06-05.
  • [5] Open food data. https://world.openfoodfacts.org/data. Accessed: 2024-06-05.
  • [6] Restaurant nutrition data. https://www.menustat.org/data.html. Accessed: 2024-06-05.
  • [7] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [8] Hadi Amiri, Andrew Beam, and Isaac S Kohane. Learning to estimate nutrition facts from food descriptions. In AMIA, 2019.
  • [9] Sina Buck, Collin Krauss, Delia Waldenmaier, Christina Liebing, Nina Jendrike, Josef Högel, Boris M Pfeiffer, Cornelia Haug, and Guido Freckmann. Evaluation of meal carbohydrate counting errors in patients with type 1 diabetes. Experimental and Clinical Endocrinology & Diabetes, 130(07):475–483, 2022.
  • [10] Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762, 2024.
  • [11] Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164, 2019.
  • [12] Germany: Institute for Quality Cologne and Efficiency in Health Care (IQWiG); 2006-. Hyperglycemia and hypoglycemia in type 2 diabetes. InformedHealth.org [Internet].
  • [13] Ivan Contreras, Marti Guso, Aleix Beneyto, and Josep Vehi. Photo-based carbohydrates counting using pre-trained transformer models. IFAC-PapersOnLine, 56(2):11533–11538, 2023.
  • [14] Naomi K Fukagawa, Kyle McKillop, Pamela R Pehrsson, Alanna Moshfegh, James Harnly, and John Finley. Usda’s fooddata central: what is it and why is it needed today? The American journal of clinical nutrition, 115(3):619–624, 2022.
  • [15] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford. Datasheets for Datasets. arXiv:1803.09010 [cs], January 2020.
  • [16] Ojas Gramopadhye, Saeel Sandeep Nachane, Prateek Chanda, Ganesh Ramakrishnan, Kshitij Sharad Jadhav, Yatin Nandwani, Dinesh Raghu, and Sachindra Joshi. Few shot chain-of-thought driven reasoning to prompt llms for open ended medical question answering. arXiv preprint arXiv:2403.04890, 2024.
  • [17] Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K Bressem. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023.
  • [18] Niloofar Hezarjaribi, Sepideh Mazrouee, and Hassan Ghasemzadeh. Speech2health: a mobile framework for monitoring dietary composition from spoken data. IEEE journal of biomedical and health informatics, 22(1):252–264, 2017.
  • [19] Matthew Keller, Chi-en Amy Tai, Yuhao Chen, Pengcheng Xi, and Alexander Wong. Nutritionverse-direct: Exploring deep neural networks for multitask nutrition prediction from food images. arXiv preprint arXiv:2405.07814, 2024.
  • [20] Mandy Korpusik, Zachary Collins, and James Glass. Semantic mapping of natural language input to database entries via convolutional neural networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5685–5689. IEEE, 2017.
  • [21] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  • [22] Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, 2023.
  • [23] Shuyang Li. Food.com recipes and interactions, 2019.
  • [24] Y Liang and J Li. Computer vision-based food calorie estimation: Dataset, method, and experiment. arxiv 2017. arXiv preprint arXiv:1705.07632.
  • [25] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  • [26] Peihua Ma, Yixin Wu, Ning Yu, Yang Zhang, Michael Backes, Qin Wang, and Cheng-I Wei. Vision-language models boost food composition compilation. arXiv preprint arXiv:2306.01747, 2023.
  • [27] Weiqing Min, Shuqiang Jiang, Linhu Liu, Yong Rui, and Ramesh Jain. A survey on food computing. ACM Computing Surveys (CSUR), 52(5):1–36, 2019.
  • [28] Saeejith Nair, Chi-en Amy Tai, Yuhao Chen, and Alexander Wong. Nutritionverse-synth: An open access synthetically generated 2d food scene dataset for dietary intake estimation. arXiv preprint arXiv:2312.06192, 2023.
  • [29] Megan E Rollo, Rebecca L Williams, Tracy Burrows, Sharon I Kirkpatrick, Tamara Bucher, and Clare E Collins. What are they really eating? a review on new approaches to dietary intake assessment and validation. Current nutrition reports, 5:307–314, 2016.
  • [30] Robin Ruede, Verena Heusser, Lukas Frank, Alina Roitberg, Monica Haurilet, and Rainer Stiefelhagen. Multi-task learning for calorie prediction on a novel large-scale recipe dataset enriched with nutritional information. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 4001–4008. IEEE, 2021.
  • [31] Sarvesh Sabarathinam. A glycemic diet improves the understanding of glycemic control in diabetes patients during their follow-up. Future Science OA, 9(3):FSO843, 2023.
  • [32] Chi-en Amy Tai, Saeejith Nair, Olivia Markham, Matthew Keller, Yifan Wu, Yuhao Chen, and Alexander Wong. Nutritionverse-real: An open access manually collected 2d food scene dataset for dietary intake estimation. arXiv preprint arXiv:2401.08598, 2023.
  • [33] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  • [34] Salima Taylor, Mandy Korpusik, Sai Das, Cheryl Gilhooly, Ryan Simpson, James Glass, and Susan Roberts. Use of natural spoken language with automated mapping of self-reported food intake to food composition data for low-burden real-time dietary assessment: method comparison study. Journal of Medical Internet Research, 23(12):e26988, 2021.
  • [35] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • [36] Quin Thames, Arjun Karpur, Wade Norris, Fangting Xia, Liviu Panait, Tobias Weyand, and Jack Sim. Nutrition5k: Towards automatic nutritional understanding of generic food. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8903–8911, 2021.
  • [37] Martina Tosi, Davide Radice, Giulia Carioni, Teresa Vecchiati, Federica Fiori, Maria Parpinel, and Patrizia Gnagnarella. Accuracy of applications to monitor food intake: Evaluation by comparison with 3-d food diary. Nutrition, 84:111018, 2021.
  • [38] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [39] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  • [40] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  • [41] Raza Yunus, Omar Arif, Hammad Afzal, Muhammad Faisal Amjad, Haider Abbas, Hira Noor Bokhari, Syeda Tazeen Haider, Nauman Zafar, and Raheel Nawaz. A framework to estimate the nutritional value of food in real time using deep learning techniques. IEEE Access, 7:2643–2652, 2018.
  • [42] Yiran Zhao, Jinghan Zhang, I Chern, Siyang Gao, Pengfei Liu, Junxian He, et al. Felm: Benchmarking factuality evaluation of large language models. Advances in Neural Information Processing Systems, 36, 2024.

Appendix A Data Distribution

A.1 Dataset Documentation and Intended uses.

We document the data using the Datasheets for Datasets framework [15] available at https://github.com/DongXzz/NutriBench/blob/main/NutriBench_Datasheet.pdf. The documentation is also attached at the end of the Appendix.

A.2 URLs to data and Croissant metadata.

A.3 Author statement.

We confirm that we bear all responsibility for any rights violations that may occur during dataset construction or other aspects of this work. We also confirm the dataset is under Creative Commons Attribution Non Commercial Share Alike 4.0 (CC BY-NC-SA 4.0) license.

Appendix B More Examples of Generated Meal Descriptions

In this section, we give more examples of meal descriptions in NutriBench.

B.1 Natural Serving Direct Retrieval

  1. 1.

    Enjoying a delicious cooked and broiled bison ribeye steak for lunch. (Single Food Item Single Serving)

  2. 2.

    For dinner, I am having 2 servings of cooked, broiled bison ribeye steak. (Single Food Item Multiple Serving)

  3. 3.

    Indulging in a pouch of caramel filled chocolate candy and using one tbsp dried rosemary to enhance the taste of my meal. (Double Food Item Single Serving)

  4. 4.

    Treating myself to half regular size pouch of chocolate candy with caramel filling and 3 tbsp of dried rosemary. (Double Food Item Multiple Serving)

  5. 5.

    Having a cup of great northern beans as my meal tonight, followed by a slice of dried mango for a tasty treat and a cookie filled with peanut butter and coated in chocolate. (Triple Food Item Single Serving)

B.2 Natural Serving Indirect Retrieval

  1. 1.

    Having a refreshing Kamikaze cocktail. (Single Food Item Single Serving)

  2. 2.

    I am having 2 Kamikaze drinks. (Single Food Item Multiple Serving)

  3. 3.

    For breakfast, I am having a cup of Okara and a cup of reduced fat (2%) milk.(Double Food Item Single Serving)

  4. 4.

    Okara is the star of my dish, and I am using 3 cups of it along with 3 cups of reduced fat (2%) milk as a mid-morning snack. (Double Food Item Multiple Serving)

  5. 5.

    Treating myself to one cubic inch decadent creme brulee, served with a sprig of raw epazote in my soup and a cup of spoonbread for breakfast. (Triple Food Item Single Serving)

B.3 Metric Serving Indirect Retrieval

  1. 1.

    Today’s dinner includes 100g of ground pork. (Single Food Item Single Serving)

  2. 2.

    Savoring a 120g serving of ground pork. (Single Food Item Multiple Serving)

  3. 3.

    Indulging in 100g Kerala mixture for a flavorful treat with 100g of CASEY’S GENERAL STORE gummi candy.(Double Food Item Single Serving)

  4. 4.

    Enjoying a small portion of 45.0g Kerala mixture for a quick bite with savoring 90.0g of CASEY’S GENERAL STORE gummi candy as a sweet treat. (Double Food Item Multiple Serving)

  5. 5.

    Indulging in 100g Kerala mixture, 100g of CASEY’S GENERAL STORE gummi candy, and a 100g portion of fresh caught fish for my meal. (Triple Food Item Single Serving)

Appendix C Details on Commonness-Based Sampling

Refer to caption
Figure 6: Commonness-based sampling pipeline
Refer to caption
Figure 7: Histogram of Commonness-Score in Filtered FDC DataBase.

As shown in Figure 6, we first use OpenAI’s ’text-embedding-3-large’ model to extract embeddings for all the food names. Next, we calculate a similarity matrix and determine the second largest value in each row as the commonness score. A higher commonness score indicates a more similar food in the database. Finally, we perform random sampling: foods with a commonness score greater than 0.75 form the common subset, while those with a score less than 0.75 form the uncommon subset.

Compared to random sampling, commonness-based sampling ensures the inclusion of uncommon foods in NutriBench. As shown in Figure 7, almost 95% of the foods have a commonness score greater than 0.9. If we apply random sampling naively, most of the selected foods would be common, resulting in a lack of diversity in NutriBench.

Appendix D Details on Meal Description Generation

In this section, we provide details about the generation of meal descriptions, including the prompts used for creating these descriptions and human verification examples.

D.1 Prompts for Meal Description Generation

In Figure 8, we show the prompts used for generating meal descriptions for the single food item subset. After creating the single food item subset, we randomly combine two or three meal descriptions to form the double and triple food item subsets. The prompts for these combinations are shown in Figure 9.

[Uncaptioned image]
Refer to caption
Figure 8: Prompts for meal description generation.
[Uncaptioned image]
Refer to caption
Figure 9: Prompts for meal description combination.

D.2 Human Verification

After generating five meal descriptions with GPT-3.5 for each food item, we randomly choose one meal description as the final meal description. However, these descriptions still require refinement before practical application. Here are some examples of the raw generated descriptions and the descriptions after human verification:

Not Follow Instruction:
Before human verification: ULTRA PERFORMANCE FOOD RELEASE meal description 1: 100g ULTRA PERFORMANCE FOOD RELEASE.
After human verification: For a quick snack, I have 100g ULTRA PERFORMANCE FOOD RELEASE.

Missing Serving Size:
Before human verification: Using dried rosemary to enhance the taste of my meal.
After human verification: Using one tbsp dried rosemary to enhance the taste of my meal.

Before human verification: Indulging in a slice of dried mango for a tasty treat and having a delicious meal of braised select brisket.
After human verification: Indulging in a slice of dried mango for a tasty treat and having a delicious meal of 1 oz braised select brisket.

Before human verification: Enjoying some SHAMI KABAB ground chicken patties for dinner.
After human verification: Enjoying 100g SHAMI KABAB ground chicken patties for dinner.

Incorrect Serving Size:
Before human verification: Lunch today consists of a whole grain white bagel.
After human verification: For breakfast, I am enjoying half a piece of whole grain white bagel.

Before human verification: Fueling up with a bowl of Honey Bunches of Oats cereal.
After human verification: Enjoying 1.5 cup of Honey Bunches of Oats for breakfast.

Appendix E Prompts for Carbohydrate Estimation

For Base and CoT methods, we query the LLM model with the meal description and some instructions as shown in Figure 10. For Llama-2 and Llama-3, we follow the special format but keep the main content the same, and Figure 11 shows the prompt for Llama-3 with Base and CoT. For RAG, we will first parse the food description into food components. The parsing prompt is shown in Figure 12. Next, we will retrieve the nutrition information about each food item and finally provide LLMs with the retrieved context as well as the meal description. The RAG prompt GPT/Alpaca-7B/Medalpaca-7B is shown in Figure 13.

Refer to caption
Figure 10: Prompts for carbohydrate estimation using GPT-3.5/Alpaca-7B/Medalpaca-7B without RAG.
Refer to caption
Figure 11: Prompts for carbohydrate estimation using Llama3-8B/70B without RAG.
Refer to caption
Figure 12: Prompts for parsing meal description into food items.
Refer to caption
Figure 13: Prompts for carbohydrate estimation using Llama3-8B/70B with RAG.

Appendix F Commonness Score and Prediction Error

Refer to caption
Figure 14: There is not a strong correlation between the commonness score and the MAE.

In this section, we compare the commonness score and the prediction error. We use the predictions from the (Natural Serving, Indirect Retrieval) subset with single food item and single serving, averaged across GPT-3.5, LLama3-8B, and LLama3-70B models using both Base and CoT methods. From Figure 14, we average the MAE within each bin and observe no strong correlation between the commonness score and the prediction errors.

Appendix G Comparing Results Across All Experiments

Refer to caption
Figure 15: MAE obtained for all models across all NutriBench  data splits and all prompting methods

Figure 15 presents a comparison of the MAE obtained for all models across all 15 data splits of NutriBench . Our observations here reinforce the findings from Section 5.

Figure 15 shows that overall, GPT-3.5 has the lowest MSE across the data splits, followed by Llama3-70B. We also observe LLMs with more parameters (e.g., Llama3-70B, Llama2-7B) generally have a lower MAE than their counterparts with fewer parameters (e.g., Llama3-8B, Llama2-7B). Further, the medical domain-tuned model, MedAlpaca, has a lower MAE compared to its general-domain counterpart, Alpaca, when averaged across the different data splits.

Subfigures (a)-(c) show the effect of increasing the number of food items from 1 to 3 while keeping the number of servings constant at 1 for the (Natural Serving, Direct Retrieval), (Natural Serving, Indirect Retrieval), and (Metric Serving, Indirect Retrieval) splits respectively. In general, we observe an increasing trend in the MAE across all models and methods with an increasing number of food items. However, as shown in Section 5.3, models exhibit a lower normalized MAE for the multi-item subsets compared to single-item subsets, indicating that providing complete meal descriptions by including all items in a single query can be more accurate than prompting LLMs for each item.

Subfigures (d)-(f) show the effect of increasing the number of servings from single to multiple, as well as simultaneously increasing the number of food items from 1 to 2 for the (Natural Serving, Direct Retrieval), (Natural Serving, Indirect Retrieval), and (Metric Serving, Indirect Retrieval) splits respectively. We observe that for the natural serving subsets, there is a general increasing trend of MAE with an increasing number of servings and items in the meal. However, the MAE decreases from single to multiple servings with a single food item in the metric serving split. We explore this phenomenon further in Section 5.3, where we observe a lower normalized MAE for serving quantities that are common and likely to be included in the training data for the LLMs.

Appendix H Normalized MAE Analysis

In Section 5.3, we find that multiple servings lead to higher normalized MAE compared to a single serving with the metric serving (100g). We hypothesize that 100g servings are more likely to appear in the training data for LLMs, resulting in better carbohydrate estimation for 100g servings. We further validate this hypothesis for natural servings. As shown in Table 7, the normalized errors for natural servings are similar between single serving and multiple servings. This similarity can be attributed to the commonality of both single servings (e.g., 1 cup) and multiple servings (e.g., 2 cups), making them more likely to be represented in the training data.

Table 7: MAE for natural servings across subsets, models, and methods.
Single Multiple
Unnormalized 14.81 26.24
Normalized 14.86 14.88

Appendix I Error Analysis

To gain a deeper understanding of the challenges of using large language models for nutrition estimation, we manually reviewed a sample of model outputs with a high error. For this analysis, we focused on the CoT and RAG+CoT methods as the intermediate steps with chain-of-thought prompting add a layer of interpretability to the models’ reasoning process. To gain insights across various dimensions of query complexities, we randomly sampled 10 queries with an absolute error greater than 15 from predictions made by GPT-3.5 from both the single food subset (with single and multiple servings) and the double food subset (with single servings). Within these, we sampled queries from the natural serving, direct retrieval, and metric serving, indirect retrieval subsets. Overall, we analyzed 120 erroneous model outputs.

Based on our analysis in the CoT setting, we categorized model errors into three main types:

  • Parsing Errors, which refer to mistakes in identifying individual food items in a query.

  • Serving Size Errors, which denote errors in determining the serving sizes of food items

  • Incorrect Predictions , which involve erroneous carbohydrate predictions, possibly due to incorrect prior knowledge or hallucination by the model.

Figure 16 shows examples of each error type.

Refer to caption
Figure 16: Examples of different types of errors identified in the error analysis. All model outputs are generated by GPT-3.5 with the CoT prompt

Across all data subsets in this setting, 79.4% of errors were due to ‘Incorrect Predictions’, 14.7% resulted from ‘Serving Size Errors’, and 5.9% came from ‘Parsing Errors’. In the double food subsets, 75% of the sampled queries contained an error in only one of the food items, while the other item was estimated correctly. Notably, none of the errors were due to mathematical calculation mistakes, including in double food and multiple serving queries.

In the RAG+CoT setting, we introduce 3 additional error categories specific to retrieval-based generation:

  • Serving Unit Conversion Errors, arising due to different serving units in the retrieved context and query

  • Misdirected Attention Error, which occur when the model focuses on incorrect context

  • Misleading Context Error, which occur when retrieved contexts closely resemble the query food item but have different carbohydrate values.

Figure 17 shows examples of the error types introduced for the RAG-based prompt.

Refer to caption
Figure 17: Examples of different types of errors relevant with the RAG-based prompts. All model outputs are generated by GPT-3.5 with the RAG+CoT prompt. * Only the relevant retrieved context is shown in the figure for conciseness.

We split the analysis for the direct and indirect retrieval subsets. With the direct retrieval, natural serving set, 74.28% of errors were ’Serving Unit Conversion Errors’. These errors occurred because most retrieved contexts contained carbohydrate values for similar food items per 100g, leading the model to use this value instead of converting to the provided serving amount and unit. ’Misdirected Attention Errors’ accounted for 11.42% of errors, where the model returned carbohydrate values from irrelevant items in the retrieved context. In rare cases, the model combined carbohydrate values from different items in the context instead of focusing on the query. 5.7% of errors were due to ’Serving Size Errors’, ’Parsing Errors’, ’Incorrect Predictions’, and ’Misleading Context’ comprised 2.86% of the errors each.

In contrast, in the indirect retrieval set, 35.90% of errors were due to ’Misleading Context’, 23.08% were ’Misdirected Attention Errors’, 17.95% were ’Serving Size Errors’, 15.38% were ’Parsing Errors’, and ’Hallucination’ and ’Unit Conversion’ errors accounted for 5.12% and 2.56% of the total errors, respectively.

Overall, the predictions exhibited a strong bias to the provided context, often deriving answers directly from the context even when it was irrelevant or required further processing.

I.0.1 Critically High Error Examination

Refer to caption
Figure 18: Examples of model outputs with critically high error. The first three rows present cases of carbohydrate overestimation, while the bottom three rows present underestimation. All outputs are generated by GPT-3.5 using the CoT prompt.

We perform a qualitative analysis of queries and model outputs leading to exceptionally high errors. We restrict this analysis to GPT-3.5 with the CoT prompt as we want to monitor natural model failure cases, without the model being influenced by possibly erroneous context in the RAG setting.

Across all data subsets in NutriBench , the model had a mean (std) absolute error of 12.68 (21.19) g for the queries over which it made a prediction. To analyze the critically high error cases, we examine the subset of model predictions with an absolute error exceeding two standard deviations from the mean (i.e., absolute errors of 55g or more).

Our first observation is that for this subset, the average true carbohydrates in the queries is 118.66g, indicating that queries with critically high errors typically include high-carb meals. This finding is also consistent with our observations in Section 5.4. Further, 76.23% of these were underestimation errors, where the predicted carbohydrates were significantly lower than the true carbohydrates.

We also qualitatively analyze these samples, separating our analysis into cases of overestimation and underestimation, since incorrect insulin doses from either case can lead to hypoglycemia or hyperglycemia, both serious problems that require separate handling.

The first three rows of Figure 18 show examples of critical overestimation by the model. The first example is from the natural unit, multiple serving, single food subset. We observe that the model accurately predicted the amount of carbs in 3 oz of the food item (corresponding ground truth: 26.9g). However, it incorrectly assumed that 1 strip is a 3 oz serving whereas it is actually around 2.1 g (0.074 oz). This shows that the model has knowledge about the carbohydrates in the item, but is unable to generalize it to the serving amount in the query.

In the second example from the natural unit, single serving, single food subset, the model accurately predicted the carbohydrates for the slice of pumpkin bread. However, it significantly overestimated the carbohydrates in the pizza. This may be due to the model not taking the specific variation of the pizza (cheese topping, thin crispy crust) into consideration when making a prediction. For instance, the food item DIGIORNO Pizza with cheese topping and rising crust has 232g per pie, which is closer to the model’s estimate.

In the third example from the metric unit, single serving, single food subset, the model significantly overestimates the carbs in the meal as it does not recognize that the only food item in the query is candy corn and the other items represent the flavors.

The bottom three rows in Figure 18 show examples of critical underestimation. In the first example from the natural unit, single serving, single food subset, we observe that the model prediction aligns more closely with the true carbohydrate content for 100g of the item (44.38g) rather than for 1 cup. In the second example from the natural unit, single serving, double food subset, we found that the FDC website includes the nutrition estimates for 0.25 cups of the food item, containing 42.1g carbohydrates. In both these cases, it is possible that the model saw the nutrition estimates for different serving quantities in the training data, but was unable to generalize to the query. In the third example from the metric unit, single serving, double food subset, the model seems to hallucinate in its predictions for both food items. As both are uncommon food items, we hypothesize that the model made unsure predictions instead of refraining from answering.

Across the critically high error cases, a common pattern is the model possessing internal knowledge of the nutrition estimates for a different serving amount, but struggling to generalize to the amount in the query. This phenomenon was also observed in the previous section with RAG, where a significant portion of the errors were ’serving unit conversion’ errors, arising when the model could not convert the provided external knowledge to be consistent with the query. These observations reveal a significant limitation of LLMs, which we plan to address in our future work.