ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages

Mehant Kammakomati^∗ {srikanth.tamilselvam}@in.ibm.com Sameer Pimparkhede^∗ {sameerp,pb}@cse.iitb.ac.in Srikanth G. Tamilselvam {srikanth.tamilselvam}@in.ibm.com
Prince Kumar {srikanth.tamilselvam}@in.ibm.com Pushpak Bhattacharyya {sameerp,pb}@cse.iitb.ac.in

Abstract

Recent work shows Large Language Models (LLMs) struggle to understand natural language constraints for various text generation tasks in zero- and few-shot settings. While, in the code domain, there is wide usage of constraints in code format to maintain the integrity of code written in Domain-Specific Languages¹¹1https://w.wiki/6jCH (DSLs), yet there has been no work evaluating LLMs with these constraints. We propose two novel tasks to assess the controllability of LLMs using hard and soft constraints represented as code across five representations. Our findings suggest that LLMs struggle to comprehend constraints in all representations irrespective of their portions in the pre-training data. While models are better at comprehending constraints in JSON, YAML and natural language representations, they struggle with constraints represented in XML and resource-rich language Python.

^*^*footnotetext: The first two authors contribute equally.

1 Introduction

Large language models (LLMs) have shown promising results (Brown et al., 2020) in comprehending and subsequently generating coherent text and code in zero and few-shot settings, especially for resource-rich languages. However, their practical utility depends on their ability to follow constraints at various granularity, encompassing user and system requirements. Recent work (Sun et al., 2023) shows LLMs have difficulty in understanding fine-grained hard constraints represented as a natural language in zero and few shot settings for tasks like paraphrase generation and numerical planning while performing decently well for coarse-grained constraints such as generating text on a given topic. While in the code domain, encoding use-case-specific constraints in code format is prevalent and crucial for enterprises to maintain the integrity of the code typically written in DSLs such as JSON, YAML, and XML that are often in structured format. Such DSLs are widely adopted for data exchange and configuration of systems such as Kubernetes²²2https://w.wiki/3kbZ, and Ansible³³3https://w.wiki/8Qg4 and there is growing interest Nair and Modani (2023) in employing LLMs for DSLs. Depending on the use case and system, the constraints that are often fine-grained are articulated as schemas⁴⁴4we use schemas and constraints in code format interchangeably in various code languages like JSON schema, YAML, or Python Pydantic to validate DSL code. A typical schema holds fine-grained hard and soft constraints such as data types, required and optional fields, default values, numerical constraints, etc. LLMs must be cognizant of such constraints for practical use in producing reliable code. Therefore, we aim to study and understand the controllability of LLMs when fine-grained constraints are represented in code format.

Listing 1: The JSON sample generated (highlighted in yellow) by the Granite 20B model does not adhere to the exclusiveMaximum and multipleOf constraints specified in the schema.

⬇

Write a JSON sample with field values as per the JSON format schema given below.

{

"type": "array",

"contains": {

"type": "number",

"multipleOf": 2.66,

"exclusiveMinimum": 0.08231885995435284,

"exclusiveMaximum": 5.1100233535478

"maxContains": 10

}

JSON sample:

‘‘‘

[2.66, 5.22, 8.88, 11.54, 14.2]

‘‘‘

To evaluate the extent to which LLMs can follow these constraints in code format, we introduce two novel tasks: (i) Data as Code generation: valid sample generation factoring in constraints. (ii) DSL validation: validate code against constraints. We generate synthetic evaluation data with various combinations of constraints using combinatorial tools. This ensures no data leak and flexible control over the evaluated constraints. Our evaluation includes constraints represented as JSON schema, YAML, XML, Python Pydantic, and natural language to be generated and validated in JSON, YAML, and XML languages. We choose different input representations for schema for two reasons: (i) The selected languages are widely used across enterprises for schema representation. (ii) To study the impact of different input representations on the model’s performance under similar constraints.

To the best of our knowledge, our work is the first to study the controllability of LLMs for constraints represented as code. Our findings suggest that for Task 1, LLMs find it challenging to follow schema instructions, primarily for Python and XML schema formats. For Task 2, we observe that it is difficult for all the LLMs to validate the sample against code schema, especially when it is not adhering to that.

		Output Representation
		JSON			YAML			XML
Model	Schema	SV	IS (%)	RTV (%)	SV	IS (%)	RTV (%)	SV	IS (%)	RTV (%)
Llama3 8B	JSON	28.2	1.9	50.1	29.2	1.8	49.8	7.9	1.6	73.9
Granite 8B		47.5	2.9	31.0	24.7	2.8	57.3	5.1	17.1	70.26
Granite 20B		50.4	13.9	15.6	37.7	2.3	38.0	10.1	7.9	71.92
Granite 34B		53.3	2.6	23.5	32.2	2.6	48.6	11.2	4.1	73.08
Codellama 34B		58.4	3.6	17.9	23.0	1.8	51.4	9.4	3.7	71.12
Llama3 8B	XML	10.2	12.9	64.1	22.5	6.1	52.8	10.2	4.8	73.5
Granite 8B		18.9	3.6	60.7	12.1	2.8	70.9	8.4	10.7	72.0
Granite 20B		24.0	2.1	53.3	12.4	1.9	73.9	8.6	12.2	70.5
Granite 34B		18.7	1.9	56.9	18.1	1.6	63.1	8.6	10.6	71.9
Codellama 34B		8.8	2.3	71.2	14.2	1.6	56.9	8.6	10.2	71.7
Llama3 8B	YAML	25.9	1.3	53.3	8.1	3.1	62.4	6.4	0.4	74.5
Granite 8B		47.0	11.2	13.7	15.7	1.8	63.9	8.6	12.2	70.5
Granite 20B		34.7	1.6	39.8	25.9	1.4	56.6	8.4	10.7	72.0
Granite 34B		52.1	3.1	14.9	26.4	1.1	40.6	8.6	10.6	71.9
Codellama 34B		48.0	7.1	24.9	27.9	1.4	50.3	9.1	12.6	71.0
Llama3 8B	Python	13.7	5.4	64.9	10.2	3.1	72.9	11.6	3.1	72.9
Granite 8B		10.2	2.4	73.0	11.9	2.3	70.9	11.1	10.7	72.71
Granite 20B		14.6	1.6	64.7	11.7	2.4	68.7	7.3	16.6	71.42
Granite 34B		17.7	2.6	61.2	13.9	2.4	66.9	10.6	8.9	69.35
Codellama 34B		13.7	5.6	65.1	11.6	2.9	64.1	8.4	14.1	69.1
Llama3 8B	NL	30.2	5.8	50.4	24.5	3.4	54.1	9.6	5.6	73.9
Granite 8B		52.3	2.1	28.9	42.1	2.6	29.2	11.1	8.3	69.24
Granite 20B		65.4	2.9	0.6	46.0	2.8	30.2	10.9	7.97	69.24
Granite 34B		69.7	2.3	1.9	55.1	2.4	8.9	10.9	9.86	63.42
Codellama 34B		60.4	2.8	60.4	40.6	2.9	34.5	8.69	7.88	65.51

Table 1: Task 1 zero shot results. SV metric denotes the percentage of perfect samples, IS denotes the percentage of invalid samples and RTV denotes the percentage of sample root data type errors. For IS and RTV, the lesser the value better the performance.

2 Data as Code Generation in DSL

Task description:

Given the schema, the generation task (see Listing 1) aims to produce a compliant data sample in code format in a DSL of interest. We draw inspiration from use cases such as synthesizing schema-compliant data from LLMs to train and evaluate smaller-sized models Song et al. (2020) and generating diverse sets of samples to be used in product test pipelines. Since data represented in DSL is structured, LLMs need to be schema-aware during generation.

Dataset:

We synthetically prepare $602$ schemas across $5$ representations having combinations of hard and soft constraints. First, we prepare JSON schemas using our combinatorial tool to generate a good mix of constraints. We then convert each JSON schema to XML and YAML schemas using automated tools to ensure equivalence across representations. Further, we include Python representation using the Pydantic library as a resource-rich general-purpose language in our evaluation generated using the Gemini-1.0-pro (Team et al., 2024) model as a code translation task. We extend our evaluation to natural language representation generated using rule-based templates over the JSON schema. We⁵⁵5The generated Python samples are manually validated by the paper’s authors. ensure equivalence of the generated schemas across languages by manually eyeballing the samples.

Evaluation metric:

Each instance of schema-compliant code that LLM generates is awarded one point when these code samples are validated using a schema validator tool. We then utilize the accuracy metric over all samples to benchmark performance across various models. Along with the accuracy evaluation metric, we also report invalid data-type samples (IS%), and samples generated with the invalid root data type (RTV%). The root data type is the data type of the whole JSON sample. For example, the root data type of sample represented in listing 1 is array. For IS and RTV metrics, the lesser the number better the performance.

Experimental setup:

We experiment with greedy decoding and beam search decoding using a beam width of 3. Both decoding techniques often perform similarly, yet greedy decoding consistently shows a slight edge; therefore, we present results using greedy decoding.

Prompts:

We experiment with zero-shot and 3-shot prompting for each model. For 3-shot prompting, we identify errors from the zero-shot setting, then select and use samples similar to the most frequent errors. Using entirely different schema representations as few-shot samples does not improve model performance. Examples of prompts can be found in Appendix 1

Results:

We observe that among the six schema representations studied, the natural language is best understood by the model for all output representations. For constraints in code format, JSON and YAML schema perform the best oweing to their wide usage in this use case though not being one of the major languages in the pre-training data. Additionally, Python though being one of the major portions of the code pre-training data, models struggle to understand the constraints represented in Python and subsequently failing to generate valid data samples across all output languages. Similarly, models struggle to comprehend XML schema and generate in XML across all schema representations. Among the Codellama and Granite families, we observe that Granite $20B$ and $34B$ are the best-performing models for all language representations except YAML where the Granite $8B$ model performs best. Codellama $34B$ model consistently gives subpar performance for code representation even when compared to the smaller Granite $8B$ model. Thus, we conclude that irrespective of their minor representation in code pre-training data, models perform best for JSON and natural language schema representations.

		Output Representation
		JSON		YAML		XML
Model	Schema	Macro-F1	Accuracy	Macro-F1	Accuracy	Macro-F1	Accuracy
Llama3 8B	JSON	0.55	0.56	0.37	0.45	0.40	0.47
Granite 8B		0.55	0.56	0.55	0.55	0.42	0.45
Granite 20B		0.48	0.52	0.37	0.44	0.47	0.53
Granite 34B		0.60	0.64	0.56	0.57	0.63	0.65
Codellama 34B		0.64	0.64	0.53	0.54	0.50	0.53
Llama3 8B	XML	0.44	0.37	0.35	0.42	0.41	0.46
Granite 8B		0.45	0.47	0.44	0.44	0.50	0.52
Granite 20B		0.24	0.37	0.45	0.47	0.56	0.57
Granite 34B		0.52	0.68	0.47	0.58	0.39	0.58
Codellama 34B		0.41	0.46	0.41	0.46	0.48	0.50
Llama3 8B	YAML	0.38	0.46	0.40	0.44	0.40	0.45
Granite 8B		0.45	0.47	0.50	0.50	0.44	0.44
Granite 20B		0.24	0.31	0.31	0.38	0.45	0.47
Granite 34B		0.52	0.68	0.55	0.61	0.47	0.58
Codellama 34B		0.59	0.59	0.52	0.53	0.58	0.58
Llama3 8B	Python	0.37	0.43	0.36	0.42	0.38	0.43
Granite 8B		0.54	0.54	0.44	0.58	0.54	0.55
Granite 20B		0.34	0.45	0.45	0.67	0.36	0.44
Granite 34B		0.53	0.54	0.47	0.67	0.40	0.46
Codellama 34B		0.48	0.49	0.45	0.53	0.46	0.44
Llama3 8B	NL	0.63	0.63	0.55	0.56	0.57	0.57
Granite 8B		0.45	0.59	0.51	0.61	0.39	0.58
Granite 20B		0.53	0.54	0.45	0.48	0.57	0.60
Granite 34B		0.45	0.55	0.46	0.46	0.38	0.56
Codellama 34B		0.52	0.57	0.54	0.57	0.42	0.50

Table 2: Task 2 zero shot results. Task 2 is a binary classification task hence we measure the performance using Macro-F1 score and Accuracy.

3 DSL Validation

Task description:

There is a growing body of work Chiang and Lee (2023); Hada et al. (2024) on showing promising usage of LLMs as evaluators as an alternative to human evaluators in many tasks. While LLMs may not be deployed to validate DSL samples when schemas are in machine-readable formats, constraints can often be articulated in natural language format where automated validation is infeasible. Further, the ability to validate against the given schema throws light on LLMs’ understanding of the relation between requirements and output. Given the DSL sample to validate and the schema, the task aims to answer the validity of the provided sample against the constraints. Through this boolean question task (see Listing 2), we study the following questions: (i) How LLMs perform on varying lengths of the schemas? (ii) How well do the LLMs perform with schema in natural language representation compared to code counterpart representations under similar constraints?

Listing 2: In the JSON sample, the values for fields stingo and anisic do not adhere to schema constraints. But the Granite 34B model gives the incorrect answer (highlighted in yellow) as yes.

⬇

Question:

Does the JSON sample {"tamil": false, "baser": null, "paltriness": "congue.", "anisic": 1906.34, "stingo": "officiis tellus. illum modi odit quas mattis nunc", "pigheadedness": 52.0} adhere to all the constraints defined in JSON format schema

{

"type": "object",

"properties": {

"tamil": {"type": "boolean"},

"baser": {"type": "null"},

"paltriness": {},

"anisic": {"type": "number", "multipleOf": 17.02},

"stingo": {"type": "string", "maxlen": 20},

"pigheadedness": {"type": "number",

"exclusiveMinimum": 27.65410407394338,

"maximum": 93.85523810367313

}

"additionalProperties": false,

"required": []

}

Respond to yes or no.

Answer:

‘‘‘

yes

‘‘‘

Dataset:

For each of the $602$ schemas across $5$ representations as described in Section 2, we generate $3$ data samples across JSON, XML, and YAML languages. First, these data samples are synthetically generated by parsing through the JSON schema by randomly pruning and selecting constraints, resulting in data samples of different lengths and constraints. We then convert the generated JSON data samples to equivalent YAML and XML formats. The Task-2 dataset consists of $3076$ instances with $45\%$ of no and $55\%$ yes instances.

Evaluation metric:

Since it is a boolean question answering task, we use Macro average F1 score and Accuracy as evaluation metrics.

Experimental setup:

The Decoding strategy used here is similar to the generation task as specified in section 2.

Prompts:

The goal of this task is to answer with either yes or no. We experiment with zero-shot and few-shot prompting. In the few shot prompting we provide two examples where one example has yes as the answer and the other has no as the answer.

Results:

Although natural language representation performs best for the generation task, it lowers the performance of higher parameter models like Granite $34B$ and codellama $34B$ for validation task. JSON representation works best for all three output languages for validation and Python and XML formats are most difficult to understand for all the models. Among the evaluated models, $34B$ models give the best performance for validation. Most of the other models give an accuracy of around $0.5$ which could be a random choice of model considering there are two options yes or no. Surprisingly the Granite $20B$ model which performs best for the generation task, underperforms other models in most cases. We observe significant improvement in small $8B$ parameter models especially the llama3-8B model with natural language representation suggesting a heavy presence of natural language instances in their pre-training data. Finally, we observe that the Granite and llama family models give a similar performance for validation tasks, unlike code generation. Generating yes or no options with random probability and subpar performance of the models compared to random probability baseline for non-JSON representation suggest a large scope of improvement for this task.

4 Related Work

Generation:

There is extensive work Chen et al. (2021); Muennighoff et al. (2024); Austin et al. (2021); Lai et al. (2022); Cassano et al. (2022); Hendrycks et al. (2021); Yin et al. (2018) on evaluating code capabilities of LLMs for various code tasks such as code completion, translation, fixing, etc. These code benchmarks are often language-scoped, where resource-rich languages such as Python are widely evaluated. Despite there being work Cassano et al. (2022) on multi-lingual code, there is scant attention to resource-poor languages such as DSLs, including YAML and JSON, though having crucial importance.

Validation:

Given the limited availability of benchmarks for low resource languages, coupled with the promising performance demonstrated by LLMs Wei et al. (2023); Sanh et al. (2022) on unseen tasks when provided with task-specific instructions, there is an emerging body of work Hada et al. (2024); Chiang and Lee (2023) in studying LLMs as evaluators for low-resource languages. However, attributing to prompt sensitivity, bias, and hallucinations, LLMs are not yet available as a drop-in replacement for human evaluators. There is some interest Lian et al. (2024) in using LLMs as evaluators for low-resource languages such as DSLs, however limited to XML and INI languages.

Controllability of LLMs:

Though LLMs are controllable on coarse-grained natural language constraints, such as generating text coherent with the provided topic or sentiment, they struggle Sun et al. (2023) with fine-grained constraints such as end the text with the given word or generate text within the given word count. Code schemas often encompass fine-grained constraints, and to the best of our knowledge, we are the first to study the controllability of LLMs for constraints represented as code.

5 Conclusion

We introduce two novel tasks to test the controllability of LLMs when constraints are in code format that are profound in the use of DSLs for data exchange and configuration of systems. We evaluate LLMs over $5$ schema representations that include YAML, JSON, Python, XML, and natural language and $3$ output representations that include YAML, JSON, and XML. Through both the tasks, we conclude that there is substantial scope of improvement for LLMs to comprehend constraints in code format to subsequently generate compliant sample or validate the given sample. Further, findings show that LLMs do not exhibit a direct correlation with the size of the language in the pre-training data used. Since LLMs struggled to comprehend constraints represented in resource-rich language Python rather showed better performance for JSON and natural language that are minor portions of the code pre-training data. We hope that our work provides some guidance on when to employ LLMs given the languages of interest and as well research in the direction to improve LLMs towards better controllability over code constraints.

Limitations

Ethics Statement

References

Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program synthesis with large language models. Preprint, arXiv:2108.07732.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
Cassano et al. (2022) Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. 2022. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. Preprint, arXiv:2208.08227.
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. Preprint, arXiv:2107.03374.
Chiang and Lee (2023) Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
Hada et al. (2024) Rishav Hada, Varun Gumma, Adrian Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, and Sunayana Sitaram. 2024. Are large language model-based evaluators the solution to scaling up multilingual evaluation? In Findings of the Association for Computational Linguistics: EACL 2024, pages 1051–1070, St. Julian’s, Malta. Association for Computational Linguistics.
Hendrycks et al. (2021) Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring coding challenge competence with apps. Preprint, arXiv:2105.09938.
Lai et al. (2022) Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Scott Wen tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2022. Ds-1000: A natural and reliable benchmark for data science code generation. Preprint, arXiv:2211.11501.
Lian et al. (2024) Xinyu Lian, Yinfang Chen, Runxiang Cheng, Jie Huang, Parth Thakkar, Minjia Zhang, and Tianyin Xu. 2024. Configuration validation with large language models. Preprint, arXiv:2310.09690.
Muennighoff et al. (2024) Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. 2024. Octopack: Instruction tuning code large language models. Preprint, arXiv:2308.07124.
Nair and Modani (2023) Inderjeet Nair and Natwar Modani. 2023. Exploiting language characteristics for legal domain-specific language model pretraining. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2516–2526, Dubrovnik, Croatia. Association for Computational Linguistics.
Sanh et al. (2022) Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M. Rush. 2022. Multitask prompted training enables zero-shot task generalization. Preprint, arXiv:2110.08207.
Song et al. (2020) Kaitao Song, Hao Sun, Xu Tan, Tao Qin, Jianfeng Lu, Hongzhi Liu, and Tie-Yan Liu. 2020. Lightpaff: A two-stage distillation framework for pre-training and fine-tuning. Preprint, arXiv:2004.12817.
Sun et al. (2023) Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Wieting, Nanyun Peng, and Xuezhe Ma. 2023. Evaluating large language models on controlled generation tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3155–3168, Singapore. Association for Computational Linguistics.
Team et al. (2024) Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, Jack Krawczyk, Cosmo Du, Ed Chi, Heng-Tze Cheng, Eric Ni, Purvi Shah, Patrick Kane, Betty Chan, Manaal Faruqui, Aliaksei Severyn, Hanzhao Lin, YaGuang Li, Yong Cheng, Abe Ittycheriah, Mahdis Mahdieh, Mia Chen, Pei Sun, Dustin Tran, Sumit Bagri, Balaji Lakshminarayanan, Jeremiah Liu, Andras Orban, Fabian Güra, Hao Zhou, Xinying Song, Aurelien Boffy, Harish Ganapathy, Steven Zheng, HyunJeong Choe, Ágoston Weisz, Tao Zhu, Yifeng Lu, Siddharth Gopal, Jarrod Kahn, Maciej Kula, Jeff Pitman, Rushin Shah, Emanuel Taropa, Majd Al Merey, Martin Baeuml, Zhifeng Chen, Laurent El Shafey, Yujing Zhang, Olcan Sercinoglu, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, Alexandre Frechette, Charlotte Smith, Laura Culp, Lev Proleev, Yi Luan, Xi Chen, James Lottes, Nathan Schucher, Federico Lebron, Alban Rrustemi, Natalie Clay, Phil Crone, Tomas Kocisky, Jeffrey Zhao, Bartek Perz, Dian Yu, Heidi Howard, Adam Bloniarz, Jack W. Rae, Han Lu, Laurent Sifre, Marcello Maggioni, Fred Alcober, Dan Garrette, Megan Barnes, Shantanu Thakoor, Jacob Austin, Gabriel Barth-Maron, William Wong, Rishabh Joshi, Rahma Chaabouni, Deeni Fatiha, Arun Ahuja, Gaurav Singh Tomar, Evan Senter, Martin Chadwick, Ilya Kornakov, Nithya Attaluri, Iñaki Iturrate, Ruibo Liu, Yunxuan Li, Sarah Cogan, Jeremy Chen, Chao Jia, Chenjie Gu, Qiao Zhang, Jordan Grimstad, Ale Jakse Hartman, Xavier Garcia, Thanumalayan Sankaranarayana Pillai, Jacob Devlin, Michael Laskin, Diego de Las Casas, Dasha Valter, Connie Tao, Lorenzo Blanco, Adrià Puigdomènech Badia, David Reitter, Mianna Chen, Jenny Brennan, Clara Rivera, Sergey Brin, Shariq Iqbal, Gabriela Surita, Jane Labanowski, Abhi Rao, Stephanie Winkler, Emilio Parisotto, Yiming Gu, Kate Olszewska, Ravi Addanki, Antoine Miech, Annie Louis, Denis Teplyashin, Geoff Brown, Elliot Catt, Jan Balaguer, Jackie Xiang, Pidong Wang, Zoe Ashwood, Anton Briukhov, Albert Webson, Sanjay Ganapathy, Smit Sanghavi, Ajay Kannan, Ming-Wei Chang, Axel Stjerngren, Josip Djolonga, Yuting Sun, Ankur Bapna, Matthew Aitchison, Pedram Pejman, Henryk Michalewski, Tianhe Yu, Cindy Wang, Juliette Love, Junwhan Ahn, Dawn Bloxwich, Kehang Han, Peter Humphreys, Thibault Sellam, James Bradbury, Varun Godbole, Sina Samangooei, Bogdan Damoc, Alex Kaskasoli, Sébastien M. R. Arnold, Vijay Vasudevan, Shubham Agrawal, Jason Riesa, Dmitry Lepikhin, Richard Tanburn, Srivatsan Srinivasan, Hyeontaek Lim, Sarah Hodkinson, Pranav Shyam, Johan Ferret, Steven Hand, Ankush Garg, Tom Le Paine, Jian Li, Yujia Li, Minh Giang, Alexander Neitz, Zaheer Abbas, Sarah York, Machel Reid, Elizabeth Cole, Aakanksha Chowdhery, Dipanjan Das, Dominika Rogozińska, Vitaliy Nikolaev, Pablo Sprechmann, Zachary Nado, Lukas Zilka, Flavien Prost, Luheng He, Marianne Monteiro, Gaurav Mishra, Chris Welty, Josh Newlan, Dawei Jia, Miltiadis Allamanis, Clara Huiyi Hu, Raoul de Liedekerke, Justin Gilmer, Carl Saroufim, Shruti Rijhwani, Shaobo Hou, Disha Shrivastava, Anirudh Baddepudi, Alex Goldin, Adnan Ozturel, Albin Cassirer, Yunhan Xu, Daniel Sohn, Devendra Sachan, Reinald Kim Amplayo, Craig Swanson, Dessie Petrova, Shashi Narayan, Arthur Guez, Siddhartha Brahma, Jessica Landon, Miteyan Patel, Ruizhe Zhao, Kevin Villela, Luyu Wang, Wenhao Jia, Matthew Rahtz, Mai Giménez, Legg Yeung, James Keeling, Petko Georgiev, Diana Mincu, Boxi Wu, Salem Haykal, Rachel Saputro, Kiran Vodrahalli, James Qin, Zeynep Cankara, Abhanshu Sharma, Nick Fernando, Will Hawkins, Behnam Neyshabur, Solomon Kim, Adrian Hutter, Priyanka Agrawal, Alex Castro-Ros, George van den Driessche, Tao Wang, Fan Yang, Shuo yiin Chang, Paul Komarek, Ross McIlroy, Mario Lučić, Guodong Zhang, Wael Farhan, Michael Sharman, Paul Natsev, Paul Michel, Yamini Bansal, Siyuan Qiao, Kris Cao, Siamak Shakeri, Christina Butterfield, Justin Chung, Paul Kishan Rubenstein, Shivani Agrawal, Arthur Mensch, Kedar Soparkar, Karel Lenc, Timothy Chung, Aedan Pope, Loren Maggiore, Jackie Kay, Priya Jhakra, Shibo Wang, Joshua Maynez, Mary Phuong, Taylor Tobin, Andrea Tacchetti, Maja Trebacz, Kevin Robinson, Yash Katariya, Sebastian Riedel, Paige Bailey, Kefan Xiao, Nimesh Ghelani, Lora Aroyo, Ambrose Slone, Neil Houlsby, Xuehan Xiong, Zhen Yang, Elena Gribovskaya, Jonas Adler, Mateo Wirth, Lisa Lee, Music Li, Thais Kagohara, Jay Pavagadhi, Sophie Bridgers, Anna Bortsova, Sanjay Ghemawat, Zafarali Ahmed, Tianqi Liu, Richard Powell, Vijay Bolina, Mariko Iinuma, Polina Zablotskaia, James Besley, Da-Woon Chung, Timothy Dozat, Ramona Comanescu, Xiance Si, Jeremy Greer, Guolong Su, Martin Polacek, Raphaël Lopez Kaufman, Simon Tokumine, Hexiang Hu, Elena Buchatskaya, Yingjie Miao, Mohamed Elhawaty, Aditya Siddhant, Nenad Tomasev, Jinwei Xing, Christina Greer, Helen Miller, Shereen Ashraf, Aurko Roy, Zizhao Zhang, Ada Ma, Angelos Filos, Milos Besta, Rory Blevins, Ted Klimenko, Chih-Kuan Yeh, Soravit Changpinyo, Jiaqi Mu, Oscar Chang, Mantas Pajarskas, Carrie Muir, Vered Cohen, Charline Le Lan, Krishna Haridasan, Amit Marathe, Steven Hansen, Sholto Douglas, Rajkumar Samuel, Mingqiu Wang, Sophia Austin, Chang Lan, Jiepu Jiang, Justin Chiu, Jaime Alonso Lorenzo, Lars Lowe Sjösund, Sébastien Cevey, Zach Gleicher, Thi Avrahami, Anudhyan Boral, Hansa Srinivasan, Vittorio Selo, Rhys May, Konstantinos Aisopos, Léonard Hussenot, Livio Baldini Soares, Kate Baumli, Michael B. Chang, Adrià Recasens, Ben Caine, Alexander Pritzel, Filip Pavetic, Fabio Pardo, Anita Gergely, Justin Frye, Vinay Ramasesh, Dan Horgan, Kartikeya Badola, Nora Kassner, Subhrajit Roy, Ethan Dyer, Víctor Campos Campos, Alex Tomala, Yunhao Tang, Dalia El Badawy, Elspeth White, Basil Mustafa, Oran Lang, Abhishek Jindal, Sharad Vikram, Zhitao Gong, Sergi Caelles, Ross Hemsley, Gregory Thornton, Fangxiaoyu Feng, Wojciech Stokowiec, Ce Zheng, Phoebe Thacker, Çağlar Ünlü, Zhishuai Zhang, Mohammad Saleh, James Svensson, Max Bileschi, Piyush Patil, Ankesh Anand, Roman Ring, Katerina Tsihlas, Arpi Vezer, Marco Selvi, Toby Shevlane, Mikel Rodriguez, Tom Kwiatkowski, Samira Daruki, Keran Rong, Allan Dafoe, Nicholas FitzGerald, Keren Gu-Lemberg, Mina Khan, Lisa Anne Hendricks, Marie Pellat, Vladimir Feinberg, James Cobon-Kerr, Tara Sainath, Maribeth Rauh, Sayed Hadi Hashemi, Richard Ives, Yana Hasson, Eric Noland, Yuan Cao, Nathan Byrd, Le Hou, Qingze Wang, Thibault Sottiaux, Michela Paganini, Jean-Baptiste Lespiau, Alexandre Moufarek, Samer Hassan, Kaushik Shivakumar, Joost van Amersfoort, Amol Mandhane, Pratik Joshi, Anirudh Goyal, Matthew Tung, Andrew Brock, Hannah Sheahan, Vedant Misra, Cheng Li, Nemanja Rakićević, Mostafa Dehghani, Fangyu Liu, Sid Mittal, Junhyuk Oh, Seb Noury, Eren Sezener, Fantine Huot, Matthew Lamm, Nicola De Cao, Charlie Chen, Sidharth Mudgal, Romina Stella, Kevin Brooks, Gautam Vasudevan, Chenxi Liu, Mainak Chain, Nivedita Melinkeri, Aaron Cohen, Venus Wang, Kristie Seymore, Sergey Zubkov, Rahul Goel, Summer Yue, Sai Krishnakumaran, Brian Albert, Nate Hurley, Motoki Sano, Anhad Mohananey, Jonah Joughin, Egor Filonov, Tomasz Kępa, Yomna Eldawy, Jiawern Lim, Rahul Rishi, Shirin Badiezadegan, Taylor Bos, Jerry Chang, Sanil Jain, Sri Gayatri Sundara Padmanabhan, Subha Puttagunta, Kalpesh Krishna, Leslie Baker, Norbert Kalb, Vamsi Bedapudi, Adam Kurzrok, Shuntong Lei, Anthony Yu, Oren Litvin, Xiang Zhou, Zhichun Wu, Sam Sobell, Andrea Siciliano, Alan Papir, Robby Neale, Jonas Bragagnolo, Tej Toor, Tina Chen, Valentin Anklin, Feiran Wang, Richie Feng, Milad Gholami, Kevin Ling, Lijuan Liu, Jules Walter, Hamid Moghaddam, Arun Kishore, Jakub Adamek, Tyler Mercado, Jonathan Mallinson, Siddhinita Wandekar, Stephen Cagle, Eran Ofek, Guillermo Garrido, Clemens Lombriser, Maksim Mukha, Botu Sun, Hafeezul Rahman Mohammad, Josip Matak, Yadi Qian, Vikas Peswani, Pawel Janus, Quan Yuan, Leif Schelin, Oana David, Ankur Garg, Yifan He, Oleksii Duzhyi, Anton Älgmyr, Timothée Lottaz, Qi Li, Vikas Yadav, Luyao Xu, Alex Chinien, Rakesh Shivanna, Aleksandr Chuklin, Josie Li, Carrie Spadine, Travis Wolfe, Kareem Mohamed, Subhabrata Das, Zihang Dai, Kyle He, Daniel von Dincklage, Shyam Upadhyay, Akanksha Maurya, Luyan Chi, Sebastian Krause, Khalid Salama, Pam G Rabinovitch, Pavan Kumar Reddy M, Aarush Selvan, Mikhail Dektiarev, Golnaz Ghiasi, Erdem Guven, Himanshu Gupta, Boyi Liu, Deepak Sharma, Idan Heimlich Shtacher, Shachi Paul, Oscar Akerlund, François-Xavier Aubet, Terry Huang, Chen Zhu, Eric Zhu, Elico Teixeira, Matthew Fritze, Francesco Bertolini, Liana-Eleonora Marinescu, Martin Bölle, Dominik Paulus, Khyatti Gupta, Tejasi Latkar, Max Chang, Jason Sanders, Roopa Wilson, Xuewei Wu, Yi-Xuan Tan, Lam Nguyen Thiet, Tulsee Doshi, Sid Lall, Swaroop Mishra, Wanming Chen, Thang Luong, Seth Benjamin, Jasmine Lee, Ewa Andrejczuk, Dominik Rabiej, Vipul Ranjan, Krzysztof Styrc, Pengcheng Yin, Jon Simon, Malcolm Rose Harriott, Mudit Bansal, Alexei Robsky, Geoff Bacon, David Greene, Daniil Mirylenka, Chen Zhou, Obaid Sarvana, Abhimanyu Goyal, Samuel Andermatt, Patrick Siegler, Ben Horn, Assaf Israel, Francesco Pongetti, Chih-Wei "Louis" Chen, Marco Selvatici, Pedro Silva, Kathie Wang, Jackson Tolins, Kelvin Guu, Roey Yogev, Xiaochen Cai, Alessandro Agostini, Maulik Shah, Hung Nguyen, Noah Ó Donnaile, Sébastien Pereira, Linda Friso, Adam Stambler, Adam Kurzrok, Chenkai Kuang, Yan Romanikhin, Mark Geller, ZJ Yan, Kane Jang, Cheng-Chun Lee, Wojciech Fica, Eric Malmi, Qijun Tan, Dan Banica, Daniel Balle, Ryan Pham, Yanping Huang, Diana Avram, Hongzhi Shi, Jasjot Singh, Chris Hidey, Niharika Ahuja, Pranab Saxena, Dan Dooley, Srividya Pranavi Potharaju, Eileen O’Neill, Anand Gokulchandran, Ryan Foley, Kai Zhao, Mike Dusenberry, Yuan Liu, Pulkit Mehta, Ragha Kotikalapudi, Chalence Safranek-Shrader, Andrew Goodman, Joshua Kessinger, Eran Globen, Prateek Kolhar, Chris Gorgolewski, Ali Ibrahim, Yang Song, Ali Eichenbaum, Thomas Brovelli, Sahitya Potluri, Preethi Lahoti, Cip Baetu, Ali Ghorbani, Charles Chen, Andy Crawford, Shalini Pal, Mukund Sridhar, Petru Gurita, Asier Mujika, Igor Petrovski, Pierre-Louis Cedoz, Chenmei Li, Shiyuan Chen, Niccolò Dal Santo, Siddharth Goyal, Jitesh Punjabi, Karthik Kappaganthu, Chester Kwak, Pallavi LV, Sarmishta Velury, Himadri Choudhury, Jamie Hall, Premal Shah, Ricardo Figueira, Matt Thomas, Minjie Lu, Ting Zhou, Chintu Kumar, Thomas Jurdi, Sharat Chikkerur, Yenai Ma, Adams Yu, Soo Kwak, Victor Ähdel, Sujeevan Rajayogam, Travis Choma, Fei Liu, Aditya Barua, Colin Ji, Ji Ho Park, Vincent Hellendoorn, Alex Bailey, Taylan Bilal, Huanjie Zhou, Mehrdad Khatir, Charles Sutton, Wojciech Rzadkowski, Fiona Macintosh, Konstantin Shagin, Paul Medina, Chen Liang, Jinjing Zhou, Pararth Shah, Yingying Bi, Attila Dankovics, Shipra Banga, Sabine Lehmann, Marissa Bredesen, Zifan Lin, John Eric Hoffmann, Jonathan Lai, Raynald Chung, Kai Yang, Nihal Balani, Arthur Bražinskas, Andrei Sozanschi, Matthew Hayes, Héctor Fernández Alcalde, Peter Makarov, Will Chen, Antonio Stella, Liselotte Snijders, Michael Mandl, Ante Kärrman, Paweł Nowak, Xinyi Wu, Alex Dyck, Krishnan Vaidyanathan, Raghavender R, Jessica Mallet, Mitch Rudominer, Eric Johnston, Sushil Mittal, Akhil Udathu, Janara Christensen, Vishal Verma, Zach Irving, Andreas Santucci, Gamaleldin Elsayed, Elnaz Davoodi, Marin Georgiev, Ian Tenney, Nan Hua, Geoffrey Cideron, Edouard Leurent, Mahmoud Alnahlawi, Ionut Georgescu, Nan Wei, Ivy Zheng, Dylan Scandinaro, Heinrich Jiang, Jasper Snoek, Mukund Sundararajan, Xuezhi Wang, Zack Ontiveros, Itay Karo, Jeremy Cole, Vinu Rajashekhar, Lara Tumeh, Eyal Ben-David, Rishub Jain, Jonathan Uesato, Romina Datta, Oskar Bunyan, Shimu Wu, John Zhang, Piotr Stanczyk, Ye Zhang, David Steiner, Subhajit Naskar, Michael Azzam, Matthew Johnson, Adam Paszke, Chung-Cheng Chiu, Jaume Sanchez Elias, Afroz Mohiuddin, Faizan Muhammad, Jin Miao, Andrew Lee, Nino Vieillard, Jane Park, Jiageng Zhang, Jeff Stanway, Drew Garmon, Abhijit Karmarkar, Zhe Dong, Jong Lee, Aviral Kumar, Luowei Zhou, Jonathan Evens, William Isaac, Geoffrey Irving, Edward Loper, Michael Fink, Isha Arkatkar, Nanxin Chen, Izhak Shafran, Ivan Petrychenko, Zhe Chen, Johnson Jia, Anselm Levskaya, Zhenkai Zhu, Peter Grabowski, Yu Mao, Alberto Magni, Kaisheng Yao, Javier Snaider, Norman Casagrande, Evan Palmer, Paul Suganthan, Alfonso Castaño, Irene Giannoumis, Wooyeol Kim, Mikołaj Rybiński, Ashwin Sreevatsa, Jennifer Prendki, David Soergel, Adrian Goedeckemeyer, Willi Gierke, Mohsen Jafari, Meenu Gaba, Jeremy Wiesner, Diana Gage Wright, Yawen Wei, Harsha Vashisht, Yana Kulizhskaya, Jay Hoover, Maigo Le, Lu Li, Chimezie Iwuanyanwu, Lu Liu, Kevin Ramirez, Andrey Khorlin, Albert Cui, Tian LIN, Marcus Wu, Ricardo Aguilar, Keith Pallo, Abhishek Chakladar, Ginger Perng, Elena Allica Abellan, Mingyang Zhang, Ishita Dasgupta, Nate Kushman, Ivo Penchev, Alena Repina, Xihui Wu, Tom van der Weide, Priya Ponnapalli, Caroline Kaplan, Jiri Simsa, Shuangfeng Li, Olivier Dousse, Fan Yang, Jeff Piper, Nathan Ie, Rama Pasumarthi, Nathan Lintz, Anitha Vijayakumar, Daniel Andor, Pedro Valenzuela, Minnie Lui, Cosmin Paduraru, Daiyi Peng, Katherine Lee, Shuyuan Zhang, Somer Greene, Duc Dung Nguyen, Paula Kurylowicz, Cassidy Hardin, Lucas Dixon, Lili Janzer, Kiam Choo, Ziqiang Feng, Biao Zhang, Achintya Singhal, Dayou Du, Dan McKinnon, Natasha Antropova, Tolga Bolukbasi, Orgad Keller, David Reid, Daniel Finchelstein, Maria Abi Raad, Remi Crocker, Peter Hawkins, Robert Dadashi, Colin Gaffney, Ken Franko, Anna Bulanova, Rémi Leblond, Shirley Chung, Harry Askham, Luis C. Cobo, Kelvin Xu, Felix Fischer, Jun Xu, Christina Sorokin, Chris Alberti, Chu-Cheng Lin, Colin Evans, Alek Dimitriev, Hannah Forbes, Dylan Banarse, Zora Tung, Mark Omernick, Colton Bishop, Rachel Sterneck, Rohan Jain, Jiawei Xia, Ehsan Amid, Francesco Piccinno, Xingyu Wang, Praseem Banzal, Daniel J. Mankowitz, Alex Polozov, Victoria Krakovna, Sasha Brown, MohammadHossein Bateni, Dennis Duan, Vlad Firoiu, Meghana Thotakuri, Tom Natan, Matthieu Geist, Ser tan Girgin, Hui Li, Jiayu Ye, Ofir Roval, Reiko Tojo, Michael Kwong, James Lee-Thorp, Christopher Yew, Danila Sinopalnikov, Sabela Ramos, John Mellor, Abhishek Sharma, Kathy Wu, David Miller, Nicolas Sonnerat, Denis Vnukov, Rory Greig, Jennifer Beattie, Emily Caveness, Libin Bai, Julian Eisenschlos, Alex Korchemniy, Tomy Tsai, Mimi Jasarevic, Weize Kong, Phuong Dao, Zeyu Zheng, Frederick Liu, Fan Yang, Rui Zhu, Tian Huey Teh, Jason Sanmiya, Evgeny Gladchenko, Nejc Trdin, Daniel Toyama, Evan Rosen, Sasan Tavakkol, Linting Xue, Chen Elkind, Oliver Woodman, John Carpenter, George Papamakarios, Rupert Kemp, Sushant Kafle, Tanya Grunina, Rishika Sinha, Alice Talbert, Diane Wu, Denese Owusu-Afriyie, Cosmo Du, Chloe Thornton, Jordi Pont-Tuset, Pradyumna Narayana, Jing Li, Saaber Fatehi, John Wieting, Omar Ajmeri, Benigno Uria, Yeongil Ko, Laura Knight, Amélie Héliou, Ning Niu, Shane Gu, Chenxi Pang, Yeqing Li, Nir Levine, Ariel Stolovich, Rebeca Santamaria-Fernandez, Sonam Goenka, Wenny Yustalim, Robin Strudel, Ali Elqursh, Charlie Deck, Hyo Lee, Zonglin Li, Kyle Levin, Raphael Hoffmann, Dan Holtmann-Rice, Olivier Bachem, Sho Arora, Christy Koh, Soheil Hassas Yeganeh, Siim Põder, Mukarram Tariq, Yanhua Sun, Lucian Ionita, Mojtaba Seyedhosseini, Pouya Tafti, Zhiyu Liu, Anmol Gulati, Jasmine Liu, Xinyu Ye, Bart Chrzaszcz, Lily Wang, Nikhil Sethi, Tianrun Li, Ben Brown, Shreya Singh, Wei Fan, Aaron Parisi, Joe Stanton, Vinod Koverkathu, Christopher A. Choquette-Choo, Yunjie Li, TJ Lu, Abe Ittycheriah, Prakash Shroff, Mani Varadarajan, Sanaz Bahargam, Rob Willoughby, David Gaddy, Guillaume Desjardins, Marco Cornero, Brona Robenek, Bhavishya Mittal, Ben Albrecht, Ashish Shenoy, Fedor Moiseev, Henrik Jacobsson, Alireza Ghaffarkhah, Morgane Rivière, Alanna Walton, Clément Crepy, Alicia Parrish, Zongwei Zhou, Clement Farabet, Carey Radebaugh, Praveen Srinivasan, Claudia van der Salm, Andreas Fidjeland, Salvatore Scellato, Eri Latorre-Chimoto, Hanna Klimczak-Plucińska, David Bridson, Dario de Cesare, Tom Hudson, Piermaria Mendolicchio, Lexi Walker, Alex Morris, Matthew Mauger, Alexey Guseynov, Alison Reid, Seth Odoom, Lucia Loher, Victor Cotruta, Madhavi Yenugula, Dominik Grewe, Anastasia Petrushkina, Tom Duerig, Antonio Sanchez, Steve Yadlowsky, Amy Shen, Amir Globerson, Lynette Webb, Sahil Dua, Dong Li, Surya Bhupatiraju, Dan Hurt, Haroon Qureshi, Ananth Agarwal, Tomer Shani, Matan Eyal, Anuj Khare, Shreyas Rammohan Belle, Lei Wang, Chetan Tekur, Mihir Sanjay Kale, Jinliang Wei, Ruoxin Sang, Brennan Saeta, Tyler Liechty, Yi Sun, Yao Zhao, Stephan Lee, Pandu Nayak, Doug Fritz, Manish Reddy Vuyyuru, John Aslanides, Nidhi Vyas, Martin Wicke, Xiao Ma, Evgenii Eltyshev, Nina Martin, Hardie Cate, James Manyika, Keyvan Amiri, Yelin Kim, Xi Xiong, Kai Kang, Florian Luisier, Nilesh Tripuraneni, David Madras, Mandy Guo, Austin Waters, Oliver Wang, Joshua Ainslie, Jason Baldridge, Han Zhang, Garima Pruthi, Jakob Bauer, Feng Yang, Riham Mansour, Jason Gelman, Yang Xu, George Polovets, Ji Liu, Honglong Cai, Warren Chen, XiangHai Sheng, Emily Xue, Sherjil Ozair, Christof Angermueller, Xiaowei Li, Anoop Sinha, Weiren Wang, Julia Wiesinger, Emmanouil Koukoumidis, Yuan Tian, Anand Iyer, Madhu Gurumurthy, Mark Goldenson, Parashar Shah, MK Blake, Hongkun Yu, Anthony Urbanowicz, Jennimaria Palomaki, Chrisantha Fernando, Ken Durden, Harsh Mehta, Nikola Momchev, Elahe Rahimtoroghi, Maria Georgaki, Amit Raul, Sebastian Ruder, Morgan Redshaw, Jinhyuk Lee, Denny Zhou, Komal Jalan, Dinghua Li, Blake Hechtman, Parker Schuh, Milad Nasr, Kieran Milan, Vladimir Mikulik, Juliana Franco, Tim Green, Nam Nguyen, Joe Kelley, Aroma Mahendru, Andrea Hu, Joshua Howland, Ben Vargas, Jeffrey Hui, Kshitij Bansal, Vikram Rao, Rakesh Ghiya, Emma Wang, Ke Ye, Jean Michel Sarr, Melanie Moranski Preston, Madeleine Elish, Steve Li, Aakash Kaku, Jigar Gupta, Ice Pasupat, Da-Cheng Juan, Milan Someswar, Tejvi M., Xinyun Chen, Aida Amini, Alex Fabrikant, Eric Chu, Xuanyi Dong, Amruta Muthal, Senaka Buthpitiya, Sarthak Jauhari, Nan Hua, Urvashi Khandelwal, Ayal Hitron, Jie Ren, Larissa Rinaldi, Shahar Drath, Avigail Dabush, Nan-Jiang Jiang, Harshal Godhia, Uli Sachs, Anthony Chen, Yicheng Fan, Hagai Taitelbaum, Hila Noga, Zhuyun Dai, James Wang, Chen Liang, Jenny Hamer, Chun-Sung Ferng, Chenel Elkind, Aviel Atias, Paulina Lee, Vít Listík, Mathias Carlen, Jan van de Kerkhof, Marcin Pikus, Krunoslav Zaher, Paul Müller, Sasha Zykova, Richard Stefanec, Vitaly Gatsko, Christoph Hirnschall, Ashwin Sethi, Xingyu Federico Xu, Chetan Ahuja, Beth Tsai, Anca Stefanoiu, Bo Feng, Keshav Dhandhania, Manish Katyal, Akshay Gupta, Atharva Parulekar, Divya Pitta, Jing Zhao, Vivaan Bhatia, Yashodha Bhavnani, Omar Alhadlaq, Xiaolin Li, Peter Danenberg, Dennis Tu, Alex Pine, Vera Filippova, Abhipso Ghosh, Ben Limonchik, Bhargava Urala, Chaitanya Krishna Lanka, Derik Clive, Yi Sun, Edward Li, Hao Wu, Kevin Hongtongsak, Ianna Li, Kalind Thakkar, Kuanysh Omarov, Kushal Majmundar, Michael Alverson, Michael Kucharski, Mohak Patel, Mudit Jain, Maksim Zabelin, Paolo Pelagatti, Rohan Kohli, Saurabh Kumar, Joseph Kim, Swetha Sankar, Vineet Shah, Lakshmi Ramachandruni, Xiangkai Zeng, Ben Bariach, Laura Weidinger, Tu Vu, Amar Subramanya, Sissie Hsiao, Demis Hassabis, Koray Kavukcuoglu, Adam Sadovsky, Quoc Le, Trevor Strohman, Yonghui Wu, Slav Petrov, Jeffrey Dean, and Oriol Vinyals. 2024. Gemini: A family of highly capable multimodal models. Preprint, arXiv:2312.11805.
Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-thought prompting elicits reasoning in large language models. Preprint, arXiv:2201.11903.
Yin et al. (2018) Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In International Conference on Mining Software Repositories, MSR, pages 476–486. ACM.

Appendix A Appendix

A.1 Prompts

This section defines the prompts which are used for models. We report different prompts for every model tried here and report the best-performing prompt results. Generally, the model consists of a System Prompt followed by a prompt template specific to the model.

A.1.1 Common prompt

For zero shot inference, we use a common prompt as it is for all the models irrespective of the model’s prompt format and we observe best results for Task-1 with this prompt. The prompt is as follows.

Listing 3: common prompt

⬇

Write an {input_representation} sample with field values as per the {output_representation} format schema given below.

{schema}

{output_representation} sample:

‘‘‘

A.1.2 Granite model family

The granite model generally follows the question-answering format. Task-1 prompts for granite family models are as follows.

System prompt:
System:
You are an intelligent AI programming assistant, utilizing a Granite code language model developed by IBM. Your primary function is to assist users in code explanation, code generation and other software engineering tasks. You MUST follow these guidelines: - Your responses must be factual. Do not assume the answer is yes when you do not know, and DO NOT SHARE FALSE INFORMATION. - You should give concise answers. You should follow the instruction and provide the answer in the specified format and DO NOT SHARE FALSE INFORMATION.

Prompt 2:

Listing 4: QA-prompt-1

⬇

{System prompt}

Question:

Write an {input_representation} sample with field values as per the {input_representation} format schema given below.

{schema}

Answer:

‘‘‘

Prompt 3:

Listing 5: QA-prompt-2

⬇

{System prompt}

Question:

Write an {input_representation} sample with field values as per the {output_representation} format schema given below. Please wrap your code

answer using ‘‘‘

{schema}

Answer:

‘‘‘

{output_representation} and {input_representation} are the variables where {input_representation} take the values JSON, YAML, XML, Python, and natural language. {output_representation} takes the values JSON, YAML, and XML.

A.1.3 Llama family

For codellama $34B$ model we wrap the common prompt in [INST] and [/INST] tags. For the llama3-8B model, we use the System prompt along with user tags ⁶⁶6https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.

System prompt: You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive. If a question does not make any sense or is not factually coherent, explain why instead of answering something not correct. If you don’t know the answer to a question, please don’t share false information.

Other than this, similar to the granite family we try Question answering format and instruction to wrap the output in quotes (“‘).