An Empirical Study of Validating Synthetic Data for Formula Generation

Usneek Singh
Aditya Kanade
Anirudh Khatry
Microsoft
Bangalore, India
&José Cambronero
Sumit Gulwani
Vu Le
Mukul Singh
Microsoft
Redmond, US
&Gust Verbruggen
Microsoft
Keerbergen, Belgium
Abstract

Large language models (LLMs) can be leveraged to help with writing formulas in spreadsheets, but resources on these formulas are scarce, impacting both the base performance of pre-trained models and limiting the ability to fine-tune them. Given a corpus of formulas, we can use a(nother) model to generate synthetic natural language utterances for fine-tuning. However, it is important to validate whether the NL generated by the LLM is indeed accurate to be beneficial for fine-tuning. In this paper, we provide empirical results on the impact of validating these synthetic training examples with surrogate objectives that evaluate the accuracy of the synthetic annotations. We demonstrate that validation improves performance over raw data across four models (2 open and 2 closed weight). Interestingly, we show that although validation tends to prune more challenging examples, it increases the complexity of problems that models can solve after being fine-tuned on validated data.

An Empirical Study of Validating Synthetic Data for Formula Generation


Usneek Singh Aditya Kanade Anirudh Khatry Microsoft Bangalore, India                        José Cambronero Sumit Gulwani Vu Le Mukul Singh Microsoft Redmond, US                        Gust Verbruggen Microsoft Keerbergen, Belgium


1 Introduction

Derived-column formulas in spreadsheets generate a new column by transforming existing columns in a table, and they have been shown challenging to write Gulwani et al. (2012). To aid users in writing such formulas, we can ask for a description in natural language Zhao et al. (2024). Unfortunately, since such formulas are sparse, therefore pre-trained language models (especially smaller) struggle in generating them without fine-tuning (0.03 for phi-2 in pass@10).

To construct a dataset for fine-tuning, public spreadsheet workbooks can be used but they contain only tables and formulas, whereas a fine-tuning dataset also requires paired natural language (NL) descriptions corresponding to each (Table, Formula). Traditionally datasets for NL-to-code tasks have been manually annotated Zhou et al. (2024); Austin et al. (2021). This is a time-consuming and expensive process. Leveraging LLMs, known for their text generation capabilities, is a viable alternative Tan et al. (2024) given that the synthetic NL generated by LLMs is accurate, as recent studies have shown that quality is more important than quantity Zhou et al. (2024); Li et al. (2023).

In this paper, we leverage LLMs to predict the accuracy of synthetic NL using 3 surrogate objectives, and show empirical results of fine-tuning models on subsets of synthetic data that are accepted by these objectives. Fine-tuning models on validated subsets shows better performance in predicting formulas compared to using raw data. For example, GPT-4 fine-tuned on data validated by alternative code generation objective saw up to a 25% improvement in evaluation scores along with a 23% reduction in training time. Additionally, we observe that the models fine-tuned on validated data perform better on more complex problems. Further, we release the synthetic dataset to seed future research in this area.

Our key contributions are as follows.

  • We define three surrogate objectives (output prediction, alternative code generation, and classification) to predict accuracy of synthetic natural language in the NL-to-Formula task.

  • We empirically analyze the effect of validating synthetic data using these objectives on fine-tuning performance of different models.

2 Related work

Formula generation

FlashFill Gulwani (2011); Gulwani et al. (2012) generates derived-column formulas by example, as users struggle with this task. SpreadsheetCoder Chen et al. (2021b) suggests formulas from surrounding context in spreadsheets. flame Joshi et al. (2024) is a small language model that understands formulas for tasks like repair and retrieval, but does not handle natural language. The NL-to-Formula (NL2F) task is introduced with a dataset obtained by converting the Text2SQL dataset to spreadsheet formulas Zhao et al. (2024). Unlike Zhao et al. (2024), our work centres on empirically evaluating different NL validation strategies.

LLMs for synthetic data

Tan et al. (2024) discusses the applications of LLMs in data annotation for classification tasks. Goel et al. (2023) demonstrates the use of LLMs in the medical domain, where they assist in labeling data with expert verification. Wang et al. (2024) and Kim et al. (2024) explore human-LLM collaborative approaches for annotation and verification. There has been no comparison of NL validation techniques on synthetic NL for NL2F.

Data quality for LLM fine-tuning

Chen and Mueller (2024) proposed an approach for automated filtering and verification of datasets to ensure high quality for LLM fine-tuning, leveraging the BSDetector Chen and Mueller (2023) to obtain confidence scores from LLM outputs. These techniques require existing ground truth labels (utterances) which are not available in our case. Zhou et al. (2024) and Li et al. (2023) manually curate data to demonstrate that instruction tuning with a small (< 1000) set of high-quality examples yields competitive results. While their work focuses on selecting examples based on alignment (already assuming correctness), our work evaluates technique-based selection on accuracy of NL instructions.

3 Validating synthetic data

Let T=[Ci]1n𝑇superscriptsubscriptdelimited-[]subscript𝐶𝑖1𝑛T=[C_{i}]_{1}^{n}italic_T = [ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be a table with n𝑛nitalic_n columns uniquely identified by hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. A derived-column formula F𝐹Fitalic_F is a formula where each leaf node is either a constant value or a column identifier hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let U𝑈Uitalic_U be an utterance in natural language that describes how to derive a column from T𝑇Titalic_T. An derived-column task is specified by (U,T,F)𝑈𝑇𝐹(U,T,F)( italic_U , italic_T , italic_F ). Given U𝑈Uitalic_U and T𝑇Titalic_T the goal is to find a formula Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that F(T)F(T)superscript𝐹𝑇𝐹𝑇F^{\prime}(T)\equiv F(T)italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_T ) ≡ italic_F ( italic_T ).

To fine-tune a model, we therefore need examples of the form (U,T,F)𝑈𝑇𝐹(U,T,F)( italic_U , italic_T , italic_F ). T𝑇Titalic_T and F𝐹Fitalic_F can be mined from large spreadsheet corpora Singh et al. (2023); Joshi et al. (2024) and we can use an LLM to generate an utterance U^=LLM(T,F)^𝑈𝐿𝐿𝑀𝑇𝐹\hat{U}=LLM(T,F)over^ start_ARG italic_U end_ARG = italic_L italic_L italic_M ( italic_T , italic_F ).

A validator V(U^,T,F)𝔹𝑉^𝑈𝑇𝐹𝔹V(\hat{U},T,F)\rightarrow\mathbb{B}italic_V ( over^ start_ARG italic_U end_ARG , italic_T , italic_F ) → blackboard_B is a function that predicts whether U^^𝑈\hat{U}over^ start_ARG italic_U end_ARG accurately describes the formula F𝐹Fitalic_F operating on table T𝑇Titalic_T. These validators can be defined in any way—even using human annotators. To reduce manual effort, we define three validators using an LLM. An overview of these three validators is shown in Figure 1.

Refer to caption
Figure 1: Overview of different validators implemented on top of GPT-4 represented by (a) 𝑽𝑶subscript𝑽𝑶V_{O}bold_italic_V start_POSTSUBSCRIPT bold_italic_O end_POSTSUBSCRIPT: This validator directly computes F(T)𝐹𝑇F(T)italic_F ( italic_T ) from (U^,T)^𝑈𝑇(\hat{U},T)( over^ start_ARG italic_U end_ARG , italic_T ); (b) 𝑽𝑷subscript𝑽𝑷V_{P}bold_italic_V start_POSTSUBSCRIPT bold_italic_P end_POSTSUBSCRIPT: Validator predicts python program P𝑃Pitalic_P from (U^,T)^𝑈𝑇(\hat{U},T)( over^ start_ARG italic_U end_ARG , italic_T ) to compare P(T)𝑃𝑇P(T)italic_P ( italic_T ) with F(T)𝐹𝑇F(T)italic_F ( italic_T ); (c) 𝑽𝑪subscript𝑽𝑪V_{C}bold_italic_V start_POSTSUBSCRIPT bold_italic_C end_POSTSUBSCRIPT: Validator directly classifies U^^𝑈\hat{U}over^ start_ARG italic_U end_ARG based on input (U^,T,F)^𝑈𝑇𝐹(\hat{U},T,F)( over^ start_ARG italic_U end_ARG , italic_T , italic_F ).
Output prediction (VOsubscript𝑉𝑂V_{O}italic_V start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT)

This validator asks the LLM to directly predict the output values F(T)𝐹𝑇F(T)italic_F ( italic_T ) from (U^,T)^𝑈𝑇(\hat{U},T)( over^ start_ARG italic_U end_ARG , italic_T ) and uses an element-wise row comparison to evaluate correctness. For numbers, we allow an absolute difference of 0.05. For strings, we use a longest common sub-sequence ratio of 0.8 as passing criterion. This approach leverages natural language to emulate the computation directly. It is inspired from the alternate task of output prediction discussed in Khatry et al. (2023)

Alternate code generation (VPsubscript𝑉𝑃V_{P}italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT)

This validator asks the LLM to predict a program P𝑃Pitalic_P in another language (we use Python) from (U^,T)^𝑈𝑇(\hat{U},T)( over^ start_ARG italic_U end_ARG , italic_T ) and compares P(T)𝑃𝑇P(T)italic_P ( italic_T ) (execution of P on T𝑇Titalic_T) with F(T)𝐹𝑇F(T)italic_F ( italic_T ) using element-wise comparison with the same relaxations for strings and numbers. This leverages the abilities of LLMs to generate popular programming languages Ni et al. (2023).

Classification (VCsubscript𝑉𝐶V_{C}italic_V start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT)

This validator directly asks the model to classify whether U^^𝑈\hat{U}over^ start_ARG italic_U end_ARG accurately describes F𝐹Fitalic_F over T𝑇Titalic_T. It is based on the self-reflection certainty objective from BSDetector Chen and Mueller (2023).

4 Experimental setup

We describe training data and models, and the testing benchmark.

Training data

We mine (T,F)𝑇𝐹(T,F)( italic_T , italic_F ) pairs that satisfy our derived-column definition from publicly available Excel workbooks Singh et al. (2023). We create a training set and validation set of size 7833 and 422 respectively. Each (T,F)𝑇𝐹(T,F)( italic_T , italic_F ) pair is annotated with an utterance U^^𝑈\hat{U}over^ start_ARG italic_U end_ARG using GPT-4.

Models

We use two open (Phi-2 and Mistral-7b-instruct) and two closed-weight (GPT-35-turbo and GPT-4) models. Phi-2 (8 ×\times× V100) and Mistral (1 ×\times× A100) were fine-tuned for 10 and 15 epochs respectively. We selected the best checkpoint using validation loss. GPT-35-turbo (16161616 ×\times× A100) and GPT-4 (24242424 ×\times× A100) were fine-tuned using the Azure API. Mistral, GPT-35-turbo, GPT-4 were fine-tuned using LoRA Hu et al. (2021).

Testing data

The SofSet dataset Barke et al. (2024) consists of 201 spreadsheet formula tasks from StackOverflow. Of these, we filter the 139 tasks that satisfy our derived-column definition.

Metric

We use the pass@k𝑘kitalic_k metric Chen et al. (2021a) based on execution match of formula, were k𝑘kitalic_k represents the number of predictions considered out of the total number of predictions provided. In our evaluation system, we generate n=10𝑛10n=10italic_n = 10 predictions at temperature 0.60.60.60.6 and consider pass@1, pass@3 and pass@10.

5 Results and Discussion

We perform experiments to empirically explore the following research questions.

  • RQ1

    How do different validators compare?

  • RQ2

    What is the impact of validating data on fine-tuning performance?

  • RQ3

    What are the differences in cases solved by models trained on validated NL and raw dataset?

5.1 RQ1: Comparing validators

We apply our three validation approaches to our initial set of 7833 points. This produces the data subsets described in Table 1. We shows properties of the formulas accepted by each validator. Since VOsubscript𝑉𝑂V_{O}italic_V start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT is bottle-necked on numerical operations, it succeeds for fewer unique functions and operators. Similarly, VPsubscript𝑉𝑃V_{P}italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT struggles with more functions than VCsubscript𝑉𝐶V_{C}italic_V start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT as there might not be an easy Python equivalent.

Table 1: Summary of training data subsets with different validation approaches. "# functions" refers to unique functions, "# calls" to average function calls, "depth" to function nesting level, and "# ops" to average arithmetic operator count in formulas.
V𝑉Vitalic_V Size # functions # calls depth # ops
\varnothing 7833 122 1.03 0.87 1.28
VOsubscript𝑉𝑂V_{O}italic_V start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT 2266 71 0.71 0.65 1.01
VPsubscript𝑉𝑃V_{P}italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT 4095 95 0.86 0.77 1.22
VCsubscript𝑉𝐶V_{C}italic_V start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT 5246 109 0.87 0.79 1.24

Figure 2 shows overlap in examples accepted by different validators. Each validator uniquely accepts at least some examples. 1403 (18%) examples does not pass any validator.

Refer to caption
Figure 2: Summary of overlap of different data subsets produced by different validation strategies.
Table 2: Performance comparison of the different models on SofSet Benchmark using pass@1, pass@3 and pass@10 metric. Three out of the four models give best performance when fine-tuned on data validated by VPsubscript𝑉𝑃V_{P}italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT.
Phi-2 Mistral GPT-35-turbo GPT-4
FT on # ex P@1 P@3 P@10 P@1 P@3 P@10 P@1 P@3 P@10 P@1 P@3 P@10
Base 0 0.01 0.01 0.03 0.01 0.03 0.05 0.19 0.27 0.35 0.28 0.34 0.4
Raw 7833 0.05 0.07 0.10 0.09 0.14 0.20 0.22 0.28 0.32 0.25 0.30 0.36
VOsubscript𝑉𝑂V_{O}italic_V start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT 2266 0.03 0.05 0.07 0.11 0.15 0.19 0.21 0.27 0.29 0.27 0.32 0.37
VPsubscript𝑉𝑃V_{P}italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT 4095 0.06 0.08 0.12 0.08 0.12 0.16 0.23 0.28 0.33 0.31 0.38 0.45
VCsubscript𝑉𝐶V_{C}italic_V start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT 5246 0.06 0.08 0.10 0.09 0.13 0.16 0.23 0.28 0.33 0.31 0.37 0.44
Table 3: Pairwise comparison of performance of sub-sampled (\subset) data from validated (V𝑉Vitalic_V) against rejected (¬V𝑉\neg V¬ italic_V) examples for different validation strategies on different models. The performance of VOsubscript𝑉𝑂V_{O}italic_V start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT is affected due to its poor handling of functions with numerical values.
Phi-2 Mistral GPT-35-turbo GPT-4
FT on # ex P@1 P@3 P@10 P@1 P@3 P@10 P@1 P@3 P@10 P@1 P@3 P@10
VOsubscript𝑉𝑂V_{O}italic_V start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT 2266 0.03 0.05 0.07 0.11 0.15 0.19 0.21 0.27 0.29 0.25 0.30 0.36
¬VOsubscript𝑉𝑂\neg V_{O}¬ italic_V start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT 5567 0.03 0.04 0.06 0.10 0.14 0.18 0.21 0.26 0.29 0.26 0.33 0.37
VCabsentsubscript𝑉𝐶\subset{V_{C}}⊂ italic_V start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT 2266 0.04 0.06 0.08 0.08 0.12 0.17 0.22 0.25 0.29 0.31 0.34 0.37
¬VCsubscript𝑉𝐶\neg V_{C}¬ italic_V start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT 2587 0.02 0.03 0.05 0.07 0.10 0.15 0.21 0.24 0.28 0.28 0.32 0.36
VPabsentsubscript𝑉𝑃\subset V_{P}⊂ italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT 2266 0.05 0.07 0.10 0.06 0.10 0.15 0.26 0.29 0.32 0.33 0.36 0.39
¬VPsubscript𝑉𝑃\neg V_{P}¬ italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT 3738 0.03 0.04 0.07 0.06 0.10 0.13 0.18 0.23 0.27 0.27 0.31 0.34
Table 4: Training time for different models on data subsets. Models fine-tuned on VPsubscript𝑉𝑃V_{P}italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and VCsubscript𝑉𝐶V_{C}italic_V start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT subsets require less time than on raw data while delivering better downstream performance.
Data Phi-2 Mistral GPT-35-turbo GPT-4
Raw 15h44m 8h51m 4h45m 14h00m
VOsubscript𝑉𝑂V_{O}italic_V start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT 4h17m 2h32m 1h54m 7h25m
VPsubscript𝑉𝑃V_{P}italic_V start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT 8h08m 4h50m 3h00m 10h50m
VCsubscript𝑉𝐶V_{C}italic_V start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT 10h3m 5h59m 3h50m 11h19m

5.1.1 Attention weights comparison

We understand the alignment of NL with formula using average attention weights of the formula tokens originating from the NL tokens in the last layer of two open-source models (Phi-2, Mistral).
Let αijsubscript𝛼𝑖𝑗\alpha_{ij}italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denote the attention weight from the i𝑖iitalic_i-th NL token to the j𝑗jitalic_j-th formula token in the last layer. The average attention weight α¯jsubscript¯𝛼𝑗\overline{\alpha}_{j}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for the j𝑗jitalic_j-th formula token from all NL token is given by:

α¯j=1NNLi=1NNLαijsubscript¯𝛼𝑗1subscript𝑁NLsuperscriptsubscript𝑖1subscript𝑁NLsubscript𝛼𝑖𝑗\overline{\alpha}_{j}=\frac{1}{N_{\text{NL}}}\sum_{i=1}^{N_{\text{NL}}}\alpha_% {ij}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT NL end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT NL end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (1)

The overall average attention weight for the entire formula α¯totalsubscript¯𝛼total\overline{\alpha}_{\text{total}}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT total end_POSTSUBSCRIPT is given by:

α¯total=1NFormulaj=1NFormulaα¯jsubscript¯𝛼total1subscript𝑁Formulasuperscriptsubscript𝑗1subscript𝑁Formulasubscript¯𝛼𝑗\overline{\alpha}_{\text{total}}=\frac{1}{N_{\text{Formula}}}\sum_{j=1}^{N_{% \text{Formula}}}\overline{\alpha}_{j}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT Formula end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT Formula end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (2)
Refer to caption
Figure 3: Attention weight scores (normalised) for different data subsets. V1= Direct Computation, V2=Code Generation & Execution and V3= Classification. The average scores are higher for subsets with validated NL

By calculating α¯totalsubscript¯𝛼total\overline{\alpha}_{\text{total}}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT total end_POSTSUBSCRIPT, we gain a measure of the interaction between NL tokens and formula tokens, highlighting the overall influence of natural language on the generation of formula tokens within the base model Yuksekgonul et al. (2023). We compare α¯totalsubscript¯𝛼total\overline{\alpha}_{\text{total}}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT total end_POSTSUBSCRIPT across different data subsets. We observe that the average α¯totalsubscript¯𝛼total\overline{\alpha}_{\text{total}}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT total end_POSTSUBSCRIPT is highest for subset with NL validated across all techniques, followed by subsets with validated NL from each technique,, then the raw data, and finally, with data subsets where NL is rejected by each technique (refer Figure 3). This analysis suggests that validated NL provides a more relevant context for generating formula tokens.

5.2 RQ2: Effect on fine-tuning performance

We compare the impact of validated data versus raw (unvalidated) data, as well as the impact of validated data versus rejected cases by each validator, on the downstream performance of the NL2F task.

Versus raw

Table 2 shows base model (few-shot) and fine-tuning performance on different subsets of data. For the smaller models, Phi-2 and Mistral, the performance increase with fine-tuning is more significant. Except for Mistral in pass@10 and GPT-35-turbo in pass@3, a smaller, validated dataset yields better performance than raw data. 𝑽𝑷subscript𝑽𝑷V_{P}bold_italic_V start_POSTSUBSCRIPT bold_italic_P end_POSTSUBSCRIPT yields the best performance on average with nearly half the size of raw data. GPT-4 improves only when fine-tuned on validated data. Surprisingly, GPT-35-turbo without fine-tuning outperforms the fine-tuned version, likely due to differences in data distribution between training and testing benchmarks. Besides performance, fine-tuning with validated data also reduces training time significantly, as shown in Table 4.

Versus invalidated

Table 3 compares the performance of fine-tuning on the accepted (& subsampled) and rejected (¬\neg¬) examples for each validator. We sub-sample the accepted sets to 2266—the number of examples in the smallest set (VOsubscript𝑉𝑂V_{O}italic_V start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT). Results of pairs (V,¬V)absent𝑉𝑉(\subset V,\neg V)( ⊂ italic_V , ¬ italic_V ) are marked in green if (V>¬V)absent𝑉𝑉(\subset V>\neg V)( ⊂ italic_V > ¬ italic_V ), blue if (V=¬V)absent𝑉𝑉(\subset V=\neg V)( ⊂ italic_V = ¬ italic_V ) and red if (V<¬V)absent𝑉𝑉(\subset V<\neg V)( ⊂ italic_V < ¬ italic_V ). We observe that, despite the smaller size of the validated data subset (subsampled), it outperforms its larger invalidated (rejected) counterpart in most (28/36) comparisons. The only decrease in performance happens for VOsubscript𝑉𝑂V_{O}italic_V start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT on GPT-4, likely due to the many functions (51) that were eliminated from the training data.

5.3 RQ3: Analysing solved cases

Figure 4 shows properties of the solved cases (where pass@10 === 1) after fine-tuning different models on raw data and validated subsets. We see that fine-tuning on datasets with fewer unique functions still enables all models (except for Mistral) to use more functions.

The average function call count increases for validated subsets compared to the raw data, indicating more complex formulas are solved by models fine-tuned on validated data. For GPT-4 and GPT-35-turbo, average operator count also increases with fine-tuning on validated data.

Refer to caption
Figure 4: Comparison of correctly solved cases on models fine-tuned with different validation subsets based on (a) Number of unique functions (b) Average number of function calls (c) Average operator count of formulas

6 Conclusion

We empirically evaluate the effect of automated validation of synthetic data using LLMs on the fine-tuning performance of derived-column NL-to-formula. We validate synthetic NL annotations with three surrogate tasks (classification, code generation in Python, and output prediction) and fine-tune different models on the examples accepted by each of these methods. In general, fine-tuning on smaller, validated datasets improves performance. Despite validation resulting in datasets with simpler formulas, that does not cause the fine-tuned models to only solve simpler problems. Further, we release our dataset to seed research in this area.

7 Limitations

Although we have focused on validating the correctness of natural language instructions, we have not addressed techniques for correcting them. Exploring methods for correcting instructions could be beneficial, as it would prevent the loss of data points. While having a smaller set of high-quality data can be advantageous for efficient training, achieving the best results may require maintaining a larger dataset by correcting invalid instructions.

In our study, the distribution of training data for fine-tuning is different than the testing data, which might not fully reflect the potential of fine-tuning. Additionally, our research has concentrated on formulas that expect a single, well-structured (formatted) input table. We aim to extend our work to include formulas that involve multiple tables and unstructured input. Furthermore, we have explored the potential of our technique in one language (English). We believe it will be valuable to investigate multilingual systems for validation setups.

References

  • Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  • Barke et al. (2024) Shraddha Barke, Christian Poelitz, Carina Suzana Negreanu, Benjamin Zorn, José Cambronero, Andrew D Gordon, Vu Le, Elnaz Nouri, Nadia Polikarpova, Advait Sarkar, et al. 2024. Solving data-centric tasks using large language models. arXiv preprint arXiv:2402.11734.
  • Chen and Mueller (2023) Jiuhai Chen and Jonas Mueller. 2023. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness.
  • Chen and Mueller (2024) Jiuhai Chen and Jonas Mueller. 2024. Automated data curation for robust language model fine-tuning. arXiv preprint arXiv:2403.12776.
  • Chen et al. (2021a) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021a. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  • Chen et al. (2021b) Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, and Denny Zhou. 2021b. Spreadsheetcoder: Formula prediction from semi-structured context. In International Conference on Machine Learning, pages 1661–1672. PMLR.
  • Goel et al. (2023) Akshay Goel, Almog Gueta, Omry Gilon, Chang Liu, Sofia Erell, Lan Huong Nguyen, Xiaohong Hao, Bolous Jaber, Shashir Reddy, Rupesh Kartha, et al. 2023. Llms accelerate annotation for medical information extraction. In Machine Learning for Health (ML4H), pages 82–100. PMLR.
  • Gulwani (2011) Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices, 46(1):317–330.
  • Gulwani et al. (2012) Sumit Gulwani, William R Harris, and Rishabh Singh. 2012. Spreadsheet data manipulation using examples. Communications of the ACM, 55(8):97–105.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  • Joshi et al. (2024) Harshit Joshi, Abishai Ebenezer, José Cambronero Sanchez, Sumit Gulwani, Aditya Kanade, Vu Le, Ivan Radiček, and Gust Verbruggen. 2024. Flame: A small language model for spreadsheet formulas. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 12995–13003.
  • Khatry et al. (2023) Anirudh Khatry, Joyce Cahoon, Jordan Henkel, Shaleen Deep, Venkatesh Emani, Avrilia Floratou, Sumit Gulwani, Vu Le, Mohammad Raza, Sherry Shi, Mukul Singh, and Ashish Tiwari. 2023. From words to code: Harnessing data for program synthesis from natural language. Preprint, arXiv:2305.01598.
  • Kim et al. (2024) Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rahman, and Dan Zhang. 2024. Meganno+: A human-llm collaborative annotation system. arXiv preprint arXiv:2402.18050.
  • Li et al. (2023) Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. 2023. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. arXiv preprint arXiv:2308.12032.
  • Ni et al. (2023) Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida I. Wang, and Xi Victoria Lin. 2023. Lever: learning to verify language-to-code generation with execution. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  • Singh et al. (2023) Mukul Singh, José Cambronero Sánchez, Sumit Gulwani, Vu Le, Carina Negreanu, Mohammad Raza, and Gust Verbruggen. 2023. Cornet: Learning table formatting rules by example. Proceedings of the VLDB Endowment, 16(10):2632–2644.
  • Tan et al. (2024) Zhen Tan, Alimohammad Beigi, Song Wang, Ruocheng Guo, Amrita Bhattacharjee, Bohan Jiang, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. 2024. Large language models for data annotation: A survey. arXiv preprint arXiv:2402.13446.
  • Wang et al. (2024) Xinru Wang, Hannah Kim, Sajjadur Rahman, Kushan Mitra, and Zhengjie Miao. 2024. Human-llm collaborative annotation through effective verification of llm labels. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–21.
  • Yuksekgonul et al. (2023) Mert Yuksekgonul, Varun Chandrasekaran, Erik Jones, Suriya Gunasekar, Ranjita Naik, Hamid Palangi, Ece Kamar, and Besmira Nushi. 2023. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. arXiv preprint arXiv:2309.15098.
  • Zhao et al. (2024) Wei Zhao, Zhitao Hou, Siyuan Wu, Yan Gao, Haoyu Dong, Yao Wan, Hongyu Zhang, Yulei Sui, and Haidong Zhang. 2024. Nl2formula: Generating spreadsheet formulas from natural language queries. arXiv preprint arXiv:2402.14853.
  • Zhou et al. (2024) Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. 2024. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36.

8 Appendix

8.1 Training Data Characteristics

In this section, we summarise important formula properties for the training data extracted from excel workbooks (see Table 5). From the original corpus, we remove any formulas that have deprecated functions to produce a set of 10,389 (table, formula) pairs. We then remove any pairs where the formula results in a missing/empty value for all output rows or uses multiple tables. After the process of filtering, our final dataset consists of 7,833 (table, formula) pairs. This dataset has formulas which use 122 distinct built-in functions. The most popular functions match those typically employed by Excel spreadsheet users: IF, SUM, IFERROR, CONCATENATE, AND. The other properties are summarised in Table 5). The function call count refers to the frequency of Excel function calls within a formula. The depth of formulas denotes the extent of nested function calls within them. Operator count is the number of arithmetic operators (+, -, *, /) in a formula.

Table 5: Characteristics of formulas used in Training Data obtained from Excel spreadsheets
Fxn. call count Formula depth Op. count
0 3554 3554 2887
1 2625 2682 2811
2 965 1030 1169
3 285 325 435
4 115 125 187
5absent5\geq 5≥ 5 289 113 344

8.2 Model hyper-parameters used while Fine-tuning

Phi-2

For the Phi-2 model, fine-tuning was performed for 10 epochs with a batch size of 8. The learning rate was set to 1e-6, and the Adam optimizer was used along with a cross-entropy loss function.

Mistral

The Mistral model was fine-tuned for 15 epochs using the LoRA technique Hu et al. (2021). The specific parameters for LoRA included a LoRA rank (Lora_r𝐿𝑜𝑟𝑎_𝑟Lora\_ritalic_L italic_o italic_r italic_a _ italic_r) of 64, a LoRA alpha (Lora_alpha𝐿𝑜𝑟𝑎_𝑎𝑙𝑝𝑎Lora\_alphaitalic_L italic_o italic_r italic_a _ italic_a italic_l italic_p italic_h italic_a) of 16, and a LoRA dropout (Lora_dropout𝐿𝑜𝑟𝑎_𝑑𝑟𝑜𝑝𝑜𝑢𝑡Lora\_dropoutitalic_L italic_o italic_r italic_a _ italic_d italic_r italic_o italic_p italic_o italic_u italic_t) of 0.1. The target modules for LoRA adaptation were "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", and "lm_head". No bias configuration was used, and the task type was Causal Language Modeling (CAUSAL_LM). The learning rate for this model was set to 2e-4, and the batch size was 8. Optimization was carried out using the PagedAdamW 32-bit optimizer.