Search | arXiv e-print repository

On Mitigating Code LLM Hallucinations with API Documentation

Authors: Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, Varun Kumar

Abstract: In this study, we address the issue of API hallucinations in various software engineering contexts. We introduce CloudAPIBench, a new benchmark designed to measure API hallucination occurrences. CloudAPIBench also provides annotations for frequencies of API occurrences in the public domain, allowing us to study API hallucinations at various frequency levels. Our findings reveal that Code LLMs stru… ▽ More In this study, we address the issue of API hallucinations in various software engineering contexts. We introduce CloudAPIBench, a new benchmark designed to measure API hallucination occurrences. CloudAPIBench also provides annotations for frequencies of API occurrences in the public domain, allowing us to study API hallucinations at various frequency levels. Our findings reveal that Code LLMs struggle with low frequency APIs: for e.g., GPT-4o achieves only 38.58% valid low frequency API invocations. We demonstrate that Documentation Augmented Generation (DAG) significantly improves performance for low frequency APIs (increase to 47.94% with DAG) but negatively impacts high frequency APIs when using sub-optimal retrievers (a 39.02% absolute drop). To mitigate this, we propose to intelligently trigger DAG where we check against an API index or leverage Code LLMs' confidence scores to retrieve only when needed. We demonstrate that our proposed methods enhance the balance between low and high frequency API performance, resulting in more reliable API invocations (8.20% absolute improvement on CloudAPIBench for GPT-4o). △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2407.03956 [pdf, other]

Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems

Authors: Shmuel Berman, Kathleen McKeown, Baishakhi Ray

Abstract: Prior research has enhanced the ability of Large Language Models (LLMs) to solve logic puzzles using techniques such as chain-of-thought prompting or introducing a symbolic representation. These frameworks are still usually insufficient to solve complicated logical problems, such as Zebra puzzles, due to the inherent complexity of translating natural language clues into logical statements. We intr… ▽ More Prior research has enhanced the ability of Large Language Models (LLMs) to solve logic puzzles using techniques such as chain-of-thought prompting or introducing a symbolic representation. These frameworks are still usually insufficient to solve complicated logical problems, such as Zebra puzzles, due to the inherent complexity of translating natural language clues into logical statements. We introduce a multi-agent system, ZPS, that integrates LLMs with an off the shelf theorem prover. This system tackles the complex puzzle-solving task by breaking down the problem into smaller, manageable parts, generating SMT (Satisfiability Modulo Theories) code to solve them with a theorem prover, and using feedback between the agents to repeatedly improve their answers. We also introduce an automated grid puzzle grader to assess the correctness of our puzzle solutions and show that the automated grader is reliable by evaluating it in a user-study. Our approach shows improvement in all three LLMs we tested, with GPT-4 showing 166% improvement in the number of fully correct solutions. △ Less

Submitted 9 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

MSC Class: 68T01; 68T20; 68T27; ACM Class: I.2.3; I.2.6; I.2.7; I.2.11

arXiv:2407.02680 [pdf, other]

KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution

Authors: Alex Mathai, Chenxi Huang, Petros Maniatis, Aleksandr Nogikh, Franjo Ivancic, Junfeng Yang, Baishakhi Ray

Abstract: Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. Unlike application-level software, a systems codebase like Linux is multilingual (low-level C/Assembly/Bash/Rust); gigantic (>20 million lines); critical (impac… ▽ More Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. Unlike application-level software, a systems codebase like Linux is multilingual (low-level C/Assembly/Bash/Rust); gigantic (>20 million lines); critical (impacting billions of devices worldwide), and highly concurrent (involving complex multi-threading). To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym (a platform) and kBench (a dataset). The kGym platform provides a SE environment for large-scale experiments on the Linux kernel, including compiling and running kernels in parallel across several virtual machines, detecting operations and crashes, inspecting logs, and querying and patching the code base. We use kGym to facilitate evaluation on kBench, a crash resolution benchmark drawn from real-world Linux kernel bugs. An example bug in kBench contains crashing stack traces, a bug-reproducer file, a developer-written fix, and other associated data. To understand current performance, we conduct baseline experiments by prompting LLMs to resolve Linux kernel crashes. Our initial evaluations reveal that the best performing LLM achieves 0.72% and 5.38% in the unassisted and assisted (i.e., buggy files disclosed to the model) settings, respectively. These results highlight the need for further research to enhance model performance in SE tasks. Improving performance on kBench requires models to master new learning skills, including understanding the cause of crashes and repairing faults, writing memory-safe and hardware-aware code, and understanding concurrency. As a result, this work opens up multiple avenues of research at the intersection of machine learning and systems software. △ Less

Submitted 8 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

arXiv:2406.19122 [pdf, other]

Mitigation of fine hydrophobic liquid aerosols by polydispersed uncharged and charged water droplets

Authors: Debabrat Biswal, Bahni Ray, Debabrata Dasgupta, Rochish M. Thaokar, Y. S. Mayya

Abstract: One of the harmful contaminants in the atmosphere, which negatively affects the well-being of both humans and animals, is the suspended respirable particles. The most difficult aspect of the study is now removing these fine respirable particles from the atmosphere. This study investigates the scavenging phenomenon of fine hydrophobic liquid aerosols (10 nm to 1050 nm) by uncharged and charged drop… ▽ More One of the harmful contaminants in the atmosphere, which negatively affects the well-being of both humans and animals, is the suspended respirable particles. The most difficult aspect of the study is now removing these fine respirable particles from the atmosphere. This study investigates the scavenging phenomenon of fine hydrophobic liquid aerosols (10 nm to 1050 nm) by uncharged and charged droplets in a self-made scaled test rig. In this study, a hollow cone nozzle with a 1 mm orifice diameter uses tap water to disperse liquid into fine droplets. The paraffin oil and Di-Ethyl-Hexyl-Sebacat (DEHS) solution are aerosolized to be scavenged by water droplets. This research employs a high-speed imaging technique and theoretical modeling approach to measure the size distribution and charge acquired by water droplets respectively. The findings of this study show that uncharged droplets dispersed △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.10994 [pdf, other]

Charged drop impinging on particles dispersed over a metallic plate: A method of particle cleaning

Authors: D. Biswal, S. K. Saroj, B. Ray, Debabrata Dasgupta, R. M. Thaokar, Y. S. Mayya

Abstract: An electric field applied to a droplet impinging on a hydrophobic surface has an extensive variety of applications, including ant-icing, heat transfer enhancement, self-cleaning, droplet manipulation, and electrostatic spraying. The present study demonstrates an effective method of particle removal using a charged droplet. This method employs a pin-plate electrode setup to investigate the dynamics… ▽ More An electric field applied to a droplet impinging on a hydrophobic surface has an extensive variety of applications, including ant-icing, heat transfer enhancement, self-cleaning, droplet manipulation, and electrostatic spraying. The present study demonstrates an effective method of particle removal using a charged droplet. This method employs a pin-plate electrode setup to investigate the dynamics of a charged droplet impact on the surface covered with particles. The particles of different properties such as wettability, electrical conductivity, etc. have been used. Silane-coated glass beads, carbon black, and glass beads are dispersed over the ground copper electrode. The applied potential is also varied from 2 kV to 4 kV. A high-speed imaging is employed to visualize the drop motion, dynamic behavior, and self-cleaning phenomenon. The experimental results indicate that drop generation and impact occur at applied potentials of 2.5, 3, and 3.5 kV, in contrast, at 2 kV, there is no droplet pinch-off. At 4 kV, electric breakdown and bridging of the droplet between the capillary and ground electrode are observed. The drop impact on the silane-coated glass bead leads to their attachment due to the adhesiveness of the particles and the droplet. The silane-coated particles are removed from the droplet surface due to the deformation of the drop and the electric repulsive force. In the case of carbon black and glass beads, the particles are captured by the droplet due to the electrostatic force of attraction. Higher electric potentials lead to an increased spreading diameter of the droplet. The higher electric field enhances the contact area between the droplet and the particles, thereby removing more particles. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.06461 [pdf, other]

Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies

Authors: Junlin Wang, Siddhartha Jain, Dejiao Zhang, Baishakhi Ray, Varun Kumar, Ben Athiwaratkun

Abstract: A diverse array of reasoning strategies has been proposed to elicit the capabilities of large language models. However, in this paper, we point out that traditional evaluations which focus solely on performance metrics miss a key factor: the increased effectiveness due to additional compute. By overlooking this aspect, a skewed view of strategy efficiency is often presented. This paper introduces… ▽ More A diverse array of reasoning strategies has been proposed to elicit the capabilities of large language models. However, in this paper, we point out that traditional evaluations which focus solely on performance metrics miss a key factor: the increased effectiveness due to additional compute. By overlooking this aspect, a skewed view of strategy efficiency is often presented. This paper introduces a framework that incorporates the compute budget into the evaluation, providing a more informative comparison that takes into account both performance metrics and computational cost. In this budget-aware perspective, we find that complex reasoning strategies often don't surpass simpler baselines purely due to algorithmic ingenuity, but rather due to the larger computational resources allocated. When we provide a simple baseline like chain-of-thought self-consistency with comparable compute resources, it frequently outperforms reasoning strategies proposed in the literature. In this scale-aware perspective, we find that unlike self-consistency, certain strategies such as multi-agent debate or Reflexion can become worse if more compute budget is utilized. △ Less

Submitted 14 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06435 [pdf, other]

Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain

Authors: Brian Hu, Bill Ray, Alice Leung, Amy Summerville, David Joy, Christopher Funk, Arslan Basharat

Abstract: In difficult decision-making scenarios, it is common to have conflicting opinions among expert human decision-makers as there may not be a single right answer. Such decisions may be guided by different attributes that can be used to characterize an individual's decision. We introduce a novel dataset for medical triage decision-making, labeled with a set of decision-maker attributes (DMAs). This da… ▽ More In difficult decision-making scenarios, it is common to have conflicting opinions among expert human decision-makers as there may not be a single right answer. Such decisions may be guided by different attributes that can be used to characterize an individual's decision. We introduce a novel dataset for medical triage decision-making, labeled with a set of decision-maker attributes (DMAs). This dataset consists of 62 scenarios, covering six different DMAs, including ethical principles such as fairness and moral desert. We present a novel software framework for human-aligned decision-making by utilizing these DMAs, paving the way for trustworthy AI with better guardrails. Specifically, we demonstrate how large language models (LLMs) can serve as ethical decision-makers, and how their decisions can be aligned to different DMAs using zero-shot prompting. Our experiments focus on different open-source models with varying sizes and training techniques, such as Falcon, Mistral, and Llama 2. Finally, we also introduce a new form of weighted self-consistency that improves the overall quantified performance. Our results provide new research directions in the use of LLMs as alignable decision-makers. The dataset and open-source software are publicly available at: https://github.com/ITM-Kitware/llm-alignable-dm. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 15 pages total (including appendix), NAACL 2024 Industry Track

arXiv:2406.01006 [pdf, other]

SemCoder: Training Code Language Models with Comprehensive Semantics

Authors: Yangruibo Ding, Jinjun Peng, Marcus J. Min, Gail Kaiser, Junfeng Yang, Baishakhi Ray

Abstract: Code Large Language Models (Code LLMs) have excelled at tasks like code completion but often miss deeper semantics such as execution effects and dynamic states. This paper aims to bridge the gap between Code LLMs' reliance on static text data and the need for thorough semantic understanding for complex tasks like debugging and program repair. We introduce a novel strategy to train Code LLMs with c… ▽ More Code Large Language Models (Code LLMs) have excelled at tasks like code completion but often miss deeper semantics such as execution effects and dynamic states. This paper aims to bridge the gap between Code LLMs' reliance on static text data and the need for thorough semantic understanding for complex tasks like debugging and program repair. We introduce a novel strategy to train Code LLMs with comprehensive semantics, encompassing high-level functional descriptions, local execution effects of individual statements, and overall input/output behavior, thereby linking static code text with dynamic execution states. We begin by collecting PyX, a clean code corpus of fully executable samples with functional descriptions and execution tracing. We propose training Code LLMs to write code and represent and reason about execution behaviors using natural language, mimicking human verbal debugging. This approach led to the development of SemCoder, a Code LLM with only 6.7B parameters, which shows competitive performance with GPT-3.5-turbo on code generation and execution reasoning tasks. SemCoder achieves 81.1% on HumanEval (GPT-3.5-turbo: 76.8%) and 54.5% on CRUXEval-I (GPT-3.5-turbo: 50.3%). We also study the effectiveness of SemCoder's monologue-style execution reasoning compared to concrete scratchpad reasoning, showing that our approach integrates semantics from multiple dimensions more smoothly. Finally, we demonstrate the potential of applying learned semantics to improve Code LLMs' debugging and self-refining capabilities. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.18649 [pdf, other]

Training LLMs to Better Self-Debug and Explain Code

Authors: Nan Jiang, Xiaopeng Li, Shiqi Wang, Qiang Zhou, Soneya Binta Hossain, Baishakhi Ray, Varun Kumar, Xiaofei Ma, Anoop Deoras

Abstract: In the domain of code generation, self-debugging is crucial. It allows LLMs to refine their generated code based on execution feedback. This is particularly important because generating correct solutions in one attempt proves challenging for complex tasks. Prior works on self-debugging mostly focus on prompting methods by providing LLMs with few-shot examples, which work poorly on small open-sourc… ▽ More In the domain of code generation, self-debugging is crucial. It allows LLMs to refine their generated code based on execution feedback. This is particularly important because generating correct solutions in one attempt proves challenging for complex tasks. Prior works on self-debugging mostly focus on prompting methods by providing LLMs with few-shot examples, which work poorly on small open-sourced LLMs. In this work, we propose a training framework that significantly improves self-debugging capability of LLMs. Intuitively, we observe that a chain of explanations on the wrong code followed by code refinement helps LLMs better analyze the wrong code and do refinement. We thus propose an automated pipeline to collect a high-quality dataset for code explanation and refinement by generating a number of explanations and refinement trajectories and filtering via execution verification. We perform supervised fine-tuning (SFT) and further reinforcement learning (RL) on both success and failure trajectories with a novel reward design considering code explanation and refinement quality. SFT improves the pass@1 by up to 15.92% and pass@10 by 9.30% over four benchmarks. RL training brings additional up to 3.54% improvement on pass@1 and 2.55% improvement on pass@10. The trained LLMs show iterative refinement ability, and can keep refining code continuously. Lastly, our human evaluation shows that the LLMs trained with our framework generate more useful code explanations and help developers better understand bugs in source code. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.18574 [pdf, other]

SpecTra: Enhancing the Code Translation Ability of Language Models by Generating Multi-Modal Specifications

Authors: Vikram Nitin, Rahul Krishna, Baishakhi Ray

Abstract: Large language models (LLMs) are increasingly being used for the task of automated code translation, which has important real-world applications. However, most existing approaches use only the source code of a program as an input to an LLM, and do not consider the different kinds of specifications that can be extracted from a program. In this paper, we propose SpecTra, a multi-stage approach that… ▽ More Large language models (LLMs) are increasingly being used for the task of automated code translation, which has important real-world applications. However, most existing approaches use only the source code of a program as an input to an LLM, and do not consider the different kinds of specifications that can be extracted from a program. In this paper, we propose SpecTra, a multi-stage approach that uses a novel self-consistency filter to first generate high-quality static specifications, test cases, and natural language descriptions from a given program, and then uses these along with the source code to improve the quality of LLM-generated translations. We evaluate SpecTra on three code translation tasks - C to Rust, C to Go, and JavaScript to TypeScript - and show that it can enhance the performance of six popular LLMs on these tasks by up to 10 percentage points and a relative improvement of 26\%. Our research suggests that generating high-quality specifications could be a promising and efficient way to improve the performance of LLMs for code translation. We make our code and data available, anonymized for review. △ Less

Submitted 10 July, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.15805 [pdf, other]

DSAM: A Deep Learning Framework for Analyzing Temporal and Spatial Dynamics in Brain Networks

Authors: Bishal Thapaliya, Robyn Miller, Jiayu Chen, Yu-Ping Wang, Esra Akbas, Ram Sapkota, Bhaskar Ray, Pranav Suresh, Santosh Ghimire, Vince Calhoun, Jingyu Liu

Abstract: Resting-state functional magnetic resonance imaging (rs-fMRI) is a noninvasive technique pivotal for understanding human neural mechanisms of intricate cognitive processes. Most rs-fMRI studies compute a single static functional connectivity matrix across brain regions of interest, or dynamic functional connectivity matrices with a sliding window approach. These approaches are at risk of oversimpl… ▽ More Resting-state functional magnetic resonance imaging (rs-fMRI) is a noninvasive technique pivotal for understanding human neural mechanisms of intricate cognitive processes. Most rs-fMRI studies compute a single static functional connectivity matrix across brain regions of interest, or dynamic functional connectivity matrices with a sliding window approach. These approaches are at risk of oversimplifying brain dynamics and lack proper consideration of the goal at hand. While deep learning has gained substantial popularity for modeling complex relational data, its application to uncovering the spatiotemporal dynamics of the brain is still limited. We propose a novel interpretable deep learning framework that learns goal-specific functional connectivity matrix directly from time series and employs a specialized graph neural network for the final classification. Our model, DSAM, leverages temporal causal convolutional networks to capture the temporal dynamics in both low- and high-level feature representations, a temporal attention unit to identify important time points, a self-attention unit to construct the goal-specific connectivity matrix, and a novel variant of graph neural network to capture the spatial dynamics for downstream classification. To validate our approach, we conducted experiments on the Human Connectome Project dataset with 1075 samples to build and interpret the model for the classification of sex group, and the Adolescent Brain Cognitive Development Dataset with 8520 samples for independent testing. Compared our proposed framework with other state-of-art models, results suggested this novel approach goes beyond the assumption of a fixed connectivity matrix and provides evidence of goal-specific brain connectivity patterns, which opens up the potential to gain deeper insights into how the human brain adapts its functional connectivity specific to the task at hand. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: 18 Pages, 4 figures

arXiv:2405.12901 [pdf, other]

Diffusion of brightened dark excitons in a high-angle incommensurate Moiré homobilayer

Authors: Arnab Barman Ray, Trevor Ollis, Sethuraj K. R., Anthony Nickolas Vamivakas

Abstract: The last few years have witnessed a surge in interest and research efforts in the field of twistronics, especially in low-angle twisted bilayers of transition metal dichalocogenides. These novel material platforms have been demonstrated to host periodic arrays of excitonic quantum emitters, interlayer excitons with long lifetimes, and exotic many-body states. While much remains to be known and und… ▽ More The last few years have witnessed a surge in interest and research efforts in the field of twistronics, especially in low-angle twisted bilayers of transition metal dichalocogenides. These novel material platforms have been demonstrated to host periodic arrays of excitonic quantum emitters, interlayer excitons with long lifetimes, and exotic many-body states. While much remains to be known and understood about these heterostructures, the field of high-angle, incommensurate bilayers is even less explored. At twist angles larger than a few degrees, the presence of periodicity in these bilayers becomes chaotic, making the systems essentially aperiodic and incommensurate in nature due to the limitations of fabrication techniques. In this work, we demonstrate the emergence of a brightened dark intralayer exciton in twisted molybdenum diselenide homobilayer. We show that this dark exciton diffuses across the excitation spot more efficiently as compared to trions or excitons, reaching diffusion lengths greater than 4 microns. Temperature-dependent spectra provide corroborative evidence and reveal a brightened dark trion. Our results reveal some of the richness of the physics of these high-angle systems. △ Less

Submitted 12 July, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

arXiv:2405.02213 [pdf, other]

Automatic Programming: Large Language Models and Beyond

Authors: Michael R. Lyu, Baishakhi Ray, Abhik Roychoudhury, Shin Hwei Tan, Patanamon Thongtanunam

Abstract: Automatic programming has seen increasing popularity due to the emergence of tools like GitHub Copilot which rely on Large Language Models (LLMs). At the same time, automatically generated code faces challenges during deployment due to concerns around quality and trust. In this article, we study automated coding in a general sense and study the concerns around code quality, security and related is… ▽ More Automatic programming has seen increasing popularity due to the emergence of tools like GitHub Copilot which rely on Large Language Models (LLMs). At the same time, automatically generated code faces challenges during deployment due to concerns around quality and trust. In this article, we study automated coding in a general sense and study the concerns around code quality, security and related issues of programmer responsibility. These are key issues for organizations while deciding on the usage of automatically generated code. We discuss how advances in software engineering such as program repair and analysis can enable automatic programming. We conclude with a forward looking view, focusing on the programming environment of the near future, where programmers may need to switch to different roles to fully utilize the power of automatic programming. Automated repair of automatically generated programs from LLMs, can help produce higher assurance code from LLMs, along with evidence of assurance △ Less

Submitted 15 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

arXiv:2405.01567 [pdf, other]

CodeFort: Robust Training for Code Generation Models

Authors: Yuhao Zhang, Shiqi Wang, Haifeng Qian, Zijian Wang, Mingyue Shang, Linbo Liu, Sanjay Krishna Gouda, Baishakhi Ray, Murali Krishna Ramanathan, Xiaofei Ma, Anoop Deoras

Abstract: Code generation models are not robust to small perturbations, which often lead to inconsistent and incorrect generations and significantly degrade the performance of these models. Improving the robustness of code generation models is crucial to better user experience when these models are deployed in real-world applications. However, existing efforts have not addressed this issue for code generati… ▽ More Code generation models are not robust to small perturbations, which often lead to inconsistent and incorrect generations and significantly degrade the performance of these models. Improving the robustness of code generation models is crucial to better user experience when these models are deployed in real-world applications. However, existing efforts have not addressed this issue for code generation models. To fill this gap, we propose CodeFort, a framework to improve the robustness of code generation models, generalizing a large variety of code perturbations to enrich the training data and enabling various robust training strategies, mixing data augmentation, batch augmentation, adversarial logits pairing, and contrastive learning, all carefully designed to support high-throughput training. Extensive evaluations show that we improve the average robust pass rates of baseline CodeGen models from 14.79 to 21.74. Notably, the improvement in robustness against code-syntax perturbations is evidenced by a significant decrease in pass rate drop from 95.04% to 53.35% △ Less

Submitted 11 April, 2024; originally announced May 2024.

arXiv:2403.18746 [pdf, other]

CYCLE: Learning to Self-Refine the Code Generation

Authors: Yangruibo Ding, Marcus J. Min, Gail Kaiser, Baishakhi Ray

Abstract: Pre-trained code language models have achieved promising performance in code generation and improved the programming efficiency of human developers. However, their self-refinement capability is typically overlooked by the existing evaluations of code LMs, which focus only on the accuracy of the one-time prediction. For the cases when code LMs fail to implement the correct program, developers actua… ▽ More Pre-trained code language models have achieved promising performance in code generation and improved the programming efficiency of human developers. However, their self-refinement capability is typically overlooked by the existing evaluations of code LMs, which focus only on the accuracy of the one-time prediction. For the cases when code LMs fail to implement the correct program, developers actually find it hard to debug and fix the faulty prediction since it is not written by the developers themselves. Unfortunately, our study reveals that code LMs cannot efficiently self-refine their faulty generations as well. In this paper, we propose CYCLE framework, learning to self-refine the faulty generation according to the available feedback, such as the execution results reported by the test suites. We evaluate CYCLE on three popular code generation benchmarks, HumanEval, MBPP, and APPS. The results reveal that CYCLE successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the self-refinement capability of code LMs. We implement four variants of CYCLE with varied numbers of parameters across 350M, 1B, 2B, and 3B, and the experiments show that CYCLE consistently boosts the code generation performance, by up to 63.5%, across benchmarks and varied model sizes. We also notice that CYCLE outperforms code LMs that have 3$\times$ more parameters in self-refinement. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: Camera-ready for OOPSLA'24

arXiv:2403.18624 [pdf, other]

Vulnerability Detection with Code Language Models: How Far Are We?

Authors: Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, Yizheng Chen

Abstract: In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability de… ▽ More In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain. △ Less

Submitted 10 July, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

Comments: Accepted for the 47th IEEE/ACM International Conference on Software Engineering (ICSE 2025); Camera-ready Work in Progress

arXiv:2403.16921 [pdf, other]

PropTest: Automatic Property Testing for Improved Visual Programming

Authors: Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, Vicente Ordonez

Abstract: Visual Programming has recently emerged as an alternative to end-to-end black-box visual reasoning models. This type of method leverages Large Language Models (LLMs) to generate the source code for an executable computer program that solves a given problem. This strategy has the advantage of offering an interpretable reasoning path and does not require finetuning a model with task-specific data. W… ▽ More Visual Programming has recently emerged as an alternative to end-to-end black-box visual reasoning models. This type of method leverages Large Language Models (LLMs) to generate the source code for an executable computer program that solves a given problem. This strategy has the advantage of offering an interpretable reasoning path and does not require finetuning a model with task-specific data. We propose PropTest, a general strategy that improves visual programming by further using an LLM to generate code that tests for visual properties in an initial round of proposed solutions. Our method generates tests for data-type consistency, output syntax, and semantic properties. PropTest achieves comparable results to state-of-the-art methods while using publicly available LLMs. This is demonstrated across different benchmarks on visual question answering and referring expression comprehension. Particularly, PropTest improves ViperGPT by obtaining 46.1\% accuracy (+6.0\%) on GQA using Llama3-8B and 59.5\% (+8.1\%) on RefCOCO+ using CodeLlama-34B. △ Less

Submitted 22 July, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

Comments: Project Page: https://jaywonkoo17.github.io/PropTest/

arXiv:2402.00097 [pdf, other]

Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM

Authors: Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, Baishakhi Ray

Abstract: Testing plays a pivotal role in ensuring software quality, yet conventional Search Based Software Testing (SBST) methods often struggle with complex software units, achieving suboptimal test coverage. Recent works using large language models (LLMs) for test generation have focused on improving generation quality through optimizing the test generation context and correcting errors in model outputs,… ▽ More Testing plays a pivotal role in ensuring software quality, yet conventional Search Based Software Testing (SBST) methods often struggle with complex software units, achieving suboptimal test coverage. Recent works using large language models (LLMs) for test generation have focused on improving generation quality through optimizing the test generation context and correcting errors in model outputs, but use fixed prompting strategies that prompt the model to generate tests without additional guidance. As a result LLM-generated testsuites still suffer from low coverage. In this paper, we present SymPrompt, a code-aware prompting strategy for LLMs in test generation. SymPrompt's approach is based on recent work that demonstrates LLMs can solve more complex logical problems when prompted to reason about the problem in a multi-step fashion. We apply this methodology to test generation by deconstructing the testsuite generation process into a multi-stage sequence, each of which is driven by a specific prompt aligned with the execution paths of the method under test, and exposing relevant type and dependency focal context to the model. Our approach enables pretrained LLMs to generate more complete test cases without any additional training. We implement SymPrompt using the TreeSitter parsing framework and evaluate on a benchmark challenging methods from open source Python projects. SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2. Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies. △ Less

Submitted 2 April, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

arXiv:2401.02845 [pdf, other]

Protoplanetary disk size under non-ideal magnetohydrodynamics: A general formalism with inclined magnetic field

Authors: Yueh-Ning Lee, Barshan Ray, Pierre Marchand, Patrick Hennebelle

Abstract: Many mechanisms have been proposed to alleviate the magnetic catastrophe, which prevents the Keplerian disk from forming inside a collapsing magnetized core. Such propositions include inclined field and non-ideal magnetohydrodynamics effects, and have been supported with numerical experiments. Models have been formulated for typical disk sizes when a field threads the rotating disk, parallel to th… ▽ More Many mechanisms have been proposed to alleviate the magnetic catastrophe, which prevents the Keplerian disk from forming inside a collapsing magnetized core. Such propositions include inclined field and non-ideal magnetohydrodynamics effects, and have been supported with numerical experiments. Models have been formulated for typical disk sizes when a field threads the rotating disk, parallel to the rotation axis, while observations at the core scales do not seem to show evident correlation between the directions of angular momentum and the magnetic field. In the present study, we propose a new model that considers both vertical and horizontal fields and discuss their effects on the protoplanetary disk size. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: Accepted for publication in ApJ Letters

arXiv:2311.03520 [pdf, other]

Brain Networks and Intelligence: A Graph Neural Network Based Approach to Resting State fMRI Data

Authors: Bishal Thapaliya, Esra Akbas, Jiayu Chen, Raam Sapkota, Bhaskar Ray, Pranav Suresh, Vince Calhoun, Jingyu Liu

Abstract: Resting-state functional magnetic resonance imaging (rsfMRI) is a powerful tool for investigating the relationship between brain function and cognitive processes as it allows for the functional organization of the brain to be captured without relying on a specific task or stimuli. In this paper, we present a novel modeling architecture called BrainRGIN for predicting intelligence (fluid, crystalli… ▽ More Resting-state functional magnetic resonance imaging (rsfMRI) is a powerful tool for investigating the relationship between brain function and cognitive processes as it allows for the functional organization of the brain to be captured without relying on a specific task or stimuli. In this paper, we present a novel modeling architecture called BrainRGIN for predicting intelligence (fluid, crystallized, and total intelligence) using graph neural networks on rsfMRI derived static functional network connectivity matrices. Extending from the existing graph convolution networks, our approach incorporates a clustering-based embedding and graph isomorphism network in the graph convolutional layer to reflect the nature of the brain sub-network organization and efficient network expression, in combination with TopK pooling and attention-based readout functions. We evaluated our proposed architecture on a large dataset, specifically the Adolescent Brain Cognitive Development Dataset, and demonstrated its effectiveness in predicting individual differences in intelligence. Our model achieved lower mean squared errors and higher correlation scores than existing relevant graph architectures and other traditional machine learning models for all of the intelligence prediction tasks. The middle frontal gyrus exhibited a significant contribution to both fluid and crystallized intelligence, suggesting their pivotal role in these cognitive processes. Total composite scores identified a diverse set of brain regions to be relevant which underscores the complex nature of total intelligence. △ Less

Submitted 26 March, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

arXiv:2310.14053 [pdf, other]

Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

Authors: Marcus J. Min, Yangruibo Ding, Luca Buratti, Saurabh Pujar, Gail Kaiser, Suman Jana, Baishakhi Ray

Abstract: Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates the performance of Code LLMs on a set of individual tasks, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications f… ▽ More Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates the performance of Code LLMs on a set of individual tasks, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications for its own code and generating code for its own specifications. Failure to preserve self-consistency reveals a lack of understanding of the shared semantics underlying natural language and programming language, and therefore undermines the trustworthiness of a model. In this paper, we first formally define the self-consistency of Code LLMs and then design a framework, IdentityChain, which effectively and efficiently evaluates the self-consistency and conventional accuracy of a model at the same time. We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from conventional accuracy. Furthermore, we show that IdentityChain can be used as a model debugging tool to expose weaknesses of Code LLMs by demonstrating three major weaknesses that we identify in current models using IdentityChain. Our code is available at https://github.com/marcusm117/IdentityChain. △ Less

Submitted 26 February, 2024; v1 submitted 21 October, 2023; originally announced October 2023.

Comments: ICLR 2024

MSC Class: 68 ACM Class: I.2; D.2

arXiv:2310.08507 [pdf, other]

Yuga: Automatically Detecting Lifetime Annotation Bugs in the Rust Language

Authors: Vikram Nitin, Anne Mulhern, Sanjay Arora, Baishakhi Ray

Abstract: The Rust programming language is becoming increasingly popular among systems programmers due to its efficient performance and robust memory safety guarantees. Rust employs an ownership model to ensure this guarantee by allowing each value to be owned by only one identifier at a time. Additionally, it introduces the concept of borrowing and lifetimes to enable other variables to borrow the values u… ▽ More The Rust programming language is becoming increasingly popular among systems programmers due to its efficient performance and robust memory safety guarantees. Rust employs an ownership model to ensure this guarantee by allowing each value to be owned by only one identifier at a time. Additionally, it introduces the concept of borrowing and lifetimes to enable other variables to borrow the values under certain conditions temporarily. Despite its benefits, security vulnerabilities have been reported in Rust projects, often attributed to the use of "unsafe" Rust code. These vulnerabilities, in part, arise from incorrect lifetime annotations on function signatures. However, existing tools fail to detect these bugs, primarily because such bugs are rare, challenging to detect through dynamic analysis, and require explicit memory models. To overcome these limitations, first, we characterize incorrect lifetime annotations as a source of memory safety bugs and leverage this understanding to devise a novel static analysis tool, Yuga, to detect potential lifetime annotation bugs. Yuga uses a multi-phase analysis approach, starting with a quick pattern-matching algorithm to identify potential buggy components and then conducting a flow and field-sensitive alias analysis to confirm the bugs. We also curate new datasets of lifetime annotation bugs. Yuga successfully detects bugs with good precision on these datasets, and we make the code and datasets publicly available for review. △ Less

Submitted 12 October, 2023; originally announced October 2023.

arXiv:2310.07958 [pdf, other]

doi 10.1145/3597503.3639170

Towards Causal Deep Learning for Vulnerability Detection

Authors: Md Mahbubur Rahman, Ira Ceka, Chengzhi Mao, Saikat Chakraborty, Baishakhi Ray, Wei Le

Abstract: Deep learning vulnerability detection has shown promising results in recent years. However, an important challenge that still blocks it from being very useful in practice is that the model is not robust under perturbation and it cannot generalize well over the out-of-distribution (OOD) data, e.g., applying a trained model to unseen projects in real world. We hypothesize that this is because the mo… ▽ More Deep learning vulnerability detection has shown promising results in recent years. However, an important challenge that still blocks it from being very useful in practice is that the model is not robust under perturbation and it cannot generalize well over the out-of-distribution (OOD) data, e.g., applying a trained model to unseen projects in real world. We hypothesize that this is because the model learned non-robust features, e.g., variable names, that have spurious correlations with labels. When the perturbed and OOD datasets no longer have the same spurious features, the model prediction fails. To address the challenge, in this paper, we introduced causality into deep learning vulnerability detection. Our approach CausalVul consists of two phases. First, we designed novel perturbations to discover spurious features that the model may use to make predictions. Second, we applied the causal learning algorithms, specifically, do-calculus, on top of existing deep learning models to systematically remove the use of spurious features and thus promote causal based prediction. Our results show that CausalVul consistently improved the model accuracy, robustness and OOD performance for all the state-of-the-art models and datasets we experimented. To the best of our knowledge, this is the first work that introduces do calculus based causal learning to software engineering models and shows it's indeed useful for improving the model accuracy, robustness and generalization. Our replication package is located at https://figshare.com/s/0ffda320dcb96c249ef2. △ Less

Submitted 14 January, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: ICSE 2024, Camera Ready Version

arXiv:2308.02783 [pdf, other]

An investigation on the impact of two vertically aligned drops on a liquid surface

Authors: Akash Paul, Bahni Ray, Kirti Chandra Sahu, Gautam Biswas

Abstract: The dynamics of two vertically coalescing drops and a pool of the same liquid have been investigated using a Coupled Level Set and Volume of Fluid (CLSVOF) method. Such a configuration enables us to study the dynamic interaction of an arbitrary-shaped liquid conglomerate, formed owing to drop-drop coalescence, with a pool. Similar to drop-pool and drop-drop interactions, partial coalescence is obs… ▽ More The dynamics of two vertically coalescing drops and a pool of the same liquid have been investigated using a Coupled Level Set and Volume of Fluid (CLSVOF) method. Such a configuration enables us to study the dynamic interaction of an arbitrary-shaped liquid conglomerate, formed owing to drop-drop coalescence, with a pool. Similar to drop-pool and drop-drop interactions, partial coalescence is observed when a conglomerate interacts with a pool. The presence of the pool below the father drop is found to influence the coalescence characteristic of the two drops. At the same time, the movement of the capillary waves resulting from the interaction of two drops governs the coalescence dynamics of the conglomerate with the pool. As liquid interfaces interact and generate capillary waves at multiple locations, complex trajectories of capillary waves are observed, which play a crucial role in determining the pinch-off characteristics of the satellite during conglomerate-pool interaction. We examine the effect of the ratio of the diameters of the lower/father drop to the upper/mother drop (D_r) on the coalescence dynamics while maintaining the size of the mother drop constant. The variation in the coalescence dynamics due to change in $D_r$ is quantified in terms of the residence time (tau_r), pinch-off time (tau_p) and the satellite diameter to conglomerate diameter ratio (Ds/Dc). The coalescence dynamics of the conglomerate is then compared with that of an equivalent spherical drop of the same volume and also with that of a drop initialized with the same shape as that of the conglomerate. Finally, the regions of complete and partial coalescence for the conglomerate-pool interactions are demarcated on the Weber number - diameter ratio (We-Dr) space. △ Less

Submitted 5 August, 2023; originally announced August 2023.

Comments: 36 pages, 14 figures, Accepted in International Journal of Multiphase Flow

arXiv:2306.07888 [pdf, other]

CAMEO: A Causal Transfer Learning Approach for Performance Optimization of Configurable Computer Systems

Authors: Md Shahriar Iqbal, Ziyuan Zhong, Iftakhar Ahmad, Baishakhi Ray, Pooyan Jamshidi

Abstract: Modern computer systems are highly configurable, with hundreds of configuration options that interact, resulting in an enormous configuration space. As a result, optimizing performance goals (e.g., latency) in such systems is challenging due to frequent uncertainties in their environments (e.g., workload fluctuations). Recently, transfer learning has been applied to address this problem by reusing… ▽ More Modern computer systems are highly configurable, with hundreds of configuration options that interact, resulting in an enormous configuration space. As a result, optimizing performance goals (e.g., latency) in such systems is challenging due to frequent uncertainties in their environments (e.g., workload fluctuations). Recently, transfer learning has been applied to address this problem by reusing knowledge from configuration measurements from the source environments, where it is cheaper to intervene than the target environment, where any intervention is costly or impossible. Recent empirical research showed that statistical models can perform poorly when the deployment environment changes because the behavior of certain variables in the models can change dramatically from source to target. To address this issue, we propose CAMEO, a method that identifies invariant causal predictors under environmental changes, allowing the optimization process to operate in a reduced search space, leading to faster optimization of system performance. We demonstrate significant performance improvements over state-of-the-art optimization methods in MLperf deep learning systems, a video analytics pipeline, and a database system. △ Less

Submitted 3 October, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

arXiv:2306.07487 [pdf, other]

TRACED: Execution-aware Pre-training for Source Code

Authors: Yangruibo Ding, Ben Steenhoek, Kexin Pei, Gail Kaiser, Wei Le, Baishakhi Ray

Abstract: Most existing pre-trained language models for source code focus on learning the static code text, typically augmented with static code structures (abstract syntax tree, dependency graphs, etc.). However, program semantics will not be fully exposed before the real execution. Without an understanding of the program execution, statically pre-trained models fail to comprehensively capture the dynamic… ▽ More Most existing pre-trained language models for source code focus on learning the static code text, typically augmented with static code structures (abstract syntax tree, dependency graphs, etc.). However, program semantics will not be fully exposed before the real execution. Without an understanding of the program execution, statically pre-trained models fail to comprehensively capture the dynamic code properties, such as the branch coverage and the runtime variable values, and they are consequently less effective at code understanding tasks, such as retrieving semantic clones and detecting software vulnerabilities. To close the gap between the static nature of language models and the dynamic characteristics of programs, we introduce TRACED, an execution-aware pre-training strategy for source code. Specifically, we pre-train code language models with a combination of source code, executable inputs, and corresponding execution traces. Our goal is to teach code models the complicated execution logic during the pre-training, enabling the model to statically estimate the dynamic code properties without repeatedly executing code during task-specific fine-tuning. To illustrate the effectiveness of our proposed approach, we fine-tune and evaluate TRACED on three downstream tasks: static execution estimation, clone retrieval, and vulnerability detection. The empirical results show that TRACED relatively improves the statically pre-trained code models by 12.4% for complete execution path prediction and by 25.2% for runtime variable value predictions. TRACED also significantly outperforms statically pre-trained models in clone retrieval and vulnerability detection across four public benchmarks. △ Less

Submitted 12 June, 2023; originally announced June 2023.

Comments: Accepted by ICSE 2024 (Early Cycle). Camera-ready is in preparation

arXiv:2306.06490 [pdf, other]

Automated Code Editing with Search-Generate-Modify

Authors: Changshu Liu, Pelin Cetin, Yogesh Patodia, Saikat Chakraborty, Yangruibo Ding, Baishakhi Ray

Abstract: Code editing is essential in evolving software development. Many automated code editing tools have been proposed that leverage both Information Retrieval-based techniques and Machine Learning-based code generation and code editing models. Each technique comes with its own promises and perils, and they are often used together to complement their strengths and compensate for their weaknesses. This p… ▽ More Code editing is essential in evolving software development. Many automated code editing tools have been proposed that leverage both Information Retrieval-based techniques and Machine Learning-based code generation and code editing models. Each technique comes with its own promises and perils, and they are often used together to complement their strengths and compensate for their weaknesses. This paper proposes a hybrid approach to better synthesize code edits by leveraging the power of code search, generation, and modification. Our key observation is that a patch obtained by search and retrieval, even if imperfect, can provide helpful guidance to a code generation model. However, a retrieval-guided patch produced by a code generation model can still be a few tokens off from the intended patch. Such generated patches can be slightly modified to create the intended patches. SARGAM is a novel tool designed to mimic a real developer's code editing behavior. Given an original code version, the developer may search for related patches, generate or write the code, and then modify the generated code to adapt it to the right context. Our evaluation of SARGAM on edit generation shows superior performance with respect to current state-of-the-art techniques. SARGAM also shows great effectiveness on automated program repair tasks. △ Less

Submitted 26 February, 2024; v1 submitted 10 June, 2023; originally announced June 2023.

Comments: 12 pages, 10 figures

arXiv:2306.06344 [pdf, other]

Language-Guided Traffic Simulation via Scene-Level Diffusion

Authors: Ziyuan Zhong, Davis Rempe, Yuxiao Chen, Boris Ivanovic, Yulong Cao, Danfei Xu, Marco Pavone, Baishakhi Ray

Abstract: Realistic and controllable traffic simulation is a core capability that is necessary to accelerate autonomous vehicle (AV) development. However, current approaches for controlling learning-based traffic models require significant domain expertise and are difficult for practitioners to use. To remedy this, we present CTG++, a scene-level conditional diffusion model that can be guided by language in… ▽ More Realistic and controllable traffic simulation is a core capability that is necessary to accelerate autonomous vehicle (AV) development. However, current approaches for controlling learning-based traffic models require significant domain expertise and are difficult for practitioners to use. To remedy this, we present CTG++, a scene-level conditional diffusion model that can be guided by language instructions. Developing this requires tackling two challenges: the need for a realistic and controllable traffic model backbone, and an effective method to interface with a traffic model using language. To address these challenges, we first propose a scene-level diffusion model equipped with a spatio-temporal transformer backbone, which generates realistic and controllable traffic. We then harness a large language model (LLM) to convert a user's query into a loss function, guiding the diffusion model towards query-compliant generation. Through comprehensive evaluation, we demonstrate the effectiveness of our proposed method in generating realistic, query-compliant traffic simulations. △ Less

Submitted 18 October, 2023; v1 submitted 10 June, 2023; originally announced June 2023.

arXiv:2306.03234 [pdf, other]

CONCORD: Clone-aware Contrastive Learning for Source Code

Authors: Yangruibo Ding, Saikat Chakraborty, Luca Buratti, Saurabh Pujar, Alessandro Morari, Gail Kaiser, Baishakhi Ray

Abstract: Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it… ▽ More Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors. Thus, as a proxy to incorporate developers' coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD's clone-aware contrastive learning drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: Camera-ready for 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 23)

arXiv:2306.03203 [pdf, other]

A Static Evaluation of Code Completion by Large Language Models

Authors: Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Rob Kwiatkowski, Xiaopeng Li, Murali Krishna Ramanathan, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, Bing Xiang

Abstract: Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary,… ▽ More Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary, static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models. Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: Accepted by ACL 2023 industry track

arXiv:2304.12743 [pdf, other]

TraceFixer: Execution Trace-Driven Program Repair

Authors: Islem Bouzenia, Yangruibo Ding, Kexin Pei, Baishakhi Ray, Michael Pradel

Abstract: When debugging unintended program behavior, developers can often identify the point in the execution where the actual behavior diverges from the desired behavior. For example, a variable may get assigned a wrong value, which then negatively influences the remaining computation. Once a developer identifies such a divergence, how to fix the code so that it provides the desired behavior? This paper p… ▽ More When debugging unintended program behavior, developers can often identify the point in the execution where the actual behavior diverges from the desired behavior. For example, a variable may get assigned a wrong value, which then negatively influences the remaining computation. Once a developer identifies such a divergence, how to fix the code so that it provides the desired behavior? This paper presents TraceFixer, a technique for predicting how to edit source code so that it does not diverge from the expected behavior anymore. The key idea is to train a neural program repair model that not only learns from source code edits but also exploits excerpts of runtime traces. The input to the model is a partial execution trace of the incorrect code, which can be obtained automatically through code instrumentation, and the correct state that the program should reach at the divergence point, which the user provides, e.g., in an interactive debugger. Our approach fundamentally differs from current program repair techniques, which share a similar goal but exploit neither execution traces nor information about the desired program state. We evaluate TraceFixer on single-line mistakes in Python code. After training the model on hundreds of thousands of code edits created by a neural model that mimics real-world bugs, we find that exploiting execution traces improves the bug-fixing ability by 13% to 20% (depending on the dataset, within the top-10 predictions) compared to a baseline that learns from source code edits only. Applying TraceFixer to 20 real-world Python bugs shows that the approach successfully fixes 10 of them. △ Less

Submitted 25 April, 2023; originally announced April 2023.

arXiv:2303.16161 [pdf, other]

doi 10.1021/acs.nanolett.3c01177

Interplay of trapped species and absence of electron capture in Moiré heterobilayers

Authors: Arnab Barman Ray, Arunabh Mukherjee, Liangyu Qiu, Renee Sailus, Sefaattin Tongay, Anthony Nickolas Vamivakas

Abstract: Moiré heterobilayers host interlayer excitons in a natural, periodic array of trapping potentials. Recent work has elucidated the structure of the trapped interlayer excitons and the nature of photoluminescence (PL) from trapped and itinerant charged complexes such as interlayer trions in these structures. In this paper, our results serve to add to the understanding of the nature of PL emission an… ▽ More Moiré heterobilayers host interlayer excitons in a natural, periodic array of trapping potentials. Recent work has elucidated the structure of the trapped interlayer excitons and the nature of photoluminescence (PL) from trapped and itinerant charged complexes such as interlayer trions in these structures. In this paper, our results serve to add to the understanding of the nature of PL emission and explain its characteristic blueshift with increasing carrier density, along with demonstrating a significant difference between the interlayer exciton-trion conversion efficiency as compared to both localized and itinerant intra-layer species in conventional monolayers. Our results show the absence of optical generation of trions in these materials, which we suggest arises from the highly localized, near sub-nm confinement of trapped species in these Moiré potentials. △ Less

Submitted 28 March, 2023; originally announced March 2023.

Comments: 3 figures, Supplementary information available on request

arXiv:2303.07615 [pdf, other]

Variation of Gender Biases in Visual Recognition Models Before and After Finetuning

Authors: Jaspreet Ranjit, Tianlu Wang, Baishakhi Ray, Vicente Ordonez

Abstract: We introduce a framework to measure how biases change before and after fine-tuning a large scale visual recognition model for a downstream task. Deep learning models trained on increasing amounts of data are known to encode societal biases. Many computer vision systems today rely on models typically pretrained on large scale datasets. While bias mitigation techniques have been developed for tuning… ▽ More We introduce a framework to measure how biases change before and after fine-tuning a large scale visual recognition model for a downstream task. Deep learning models trained on increasing amounts of data are known to encode societal biases. Many computer vision systems today rely on models typically pretrained on large scale datasets. While bias mitigation techniques have been developed for tuning models for downstream tasks, it is currently unclear what are the effects of biases already encoded in a pretrained model. Our framework incorporates sets of canonical images representing individual and pairs of concepts to highlight changes in biases for an array of off-the-shelf pretrained models across model sizes, dataset sizes, and training objectives. Through our analyses, we find that (1) supervised models trained on datasets such as ImageNet-21k are more likely to retain their pretraining biases regardless of the target dataset compared to self-supervised models. We also find that (2) models finetuned on larger scale datasets are more likely to introduce new biased associations. Our results also suggest that (3) biases can transfer to finetuned models and the finetuning objective and dataset can impact the extent of transferred biases. △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: 10 pages, 3 Figures

arXiv:2303.05378 [pdf, other]

Greener yet Powerful: Taming Large Code Generation Models with Quantization

Authors: Xiaokai Wei, Sujan Gonugondla, Wasi Ahmad, Shiqi Wang, Baishakhi Ray, Haifeng Qian, Xiaopeng Li, Varun Kumar, Zijian Wang, Yuchen Tian, Qing Sun, Ben Athiwaratkun, Mingyue Shang, Murali Krishna Ramanathan, Parminder Bhatia, Bing Xiang

Abstract: ML-powered code generation aims to assist developers to write code in a more productive manner, by intelligently generating code blocks based on natural language prompts. Recently, large pretrained deep learning models have substantially pushed the boundary of code generation and achieved impressive performance. Despite their great power, the huge number of model parameters poses a significant thr… ▽ More ML-powered code generation aims to assist developers to write code in a more productive manner, by intelligently generating code blocks based on natural language prompts. Recently, large pretrained deep learning models have substantially pushed the boundary of code generation and achieved impressive performance. Despite their great power, the huge number of model parameters poses a significant threat to adapting them in a regular software development environment, where a developer might use a standard laptop or mid-size server to develop her code. Such large models incur significant resource usage (in terms of memory, latency, and dollars) as well as carbon footprint. Model compression is a promising approach to address these challenges. Several techniques are proposed to compress large pretrained models typically used for vision or textual data. Out of many available compression techniques, we identified that quantization is mostly applicable for code generation task as it does not require significant retraining cost. As quantization represents model parameters with lower-bit integer (e.g., int8), the model size and runtime latency would both benefit from such int representation. We extensively study the impact of quantized model on code generation tasks across different dimension: (i) resource usage and carbon footprint, (ii) accuracy, and (iii) robustness. To this end, through systematic experiments we find a recipe of quantization technique that could run even a $6$B model in a regular laptop without significant accuracy or robustness degradation. We further found the recipe is readily applicable to code summarization task as well. △ Less

Submitted 9 March, 2023; originally announced March 2023.

Comments: 10 pages, 7 figures, 10 tables

arXiv:2302.10812 [pdf, other]

On ML-Based Program Translation: Perils and Promises

Authors: Aniketh Malyala, Katelyn Zhou, Baishakhi Ray, Saikat Chakraborty

Abstract: With the advent of new and advanced programming languages, it becomes imperative to migrate legacy software to new programming languages. Unsupervised Machine Learning-based Program Translation could play an essential role in such migration, even without a sufficiently sizeable reliable corpus of parallel source code. However, these translators are far from perfect due to their statistical nature.… ▽ More With the advent of new and advanced programming languages, it becomes imperative to migrate legacy software to new programming languages. Unsupervised Machine Learning-based Program Translation could play an essential role in such migration, even without a sufficiently sizeable reliable corpus of parallel source code. However, these translators are far from perfect due to their statistical nature. This work investigates unsupervised program translators and where and why they fail. With in-depth error analysis of such failures, we have identified that the cases where such translators fail follow a few particular patterns. With this insight, we develop a rule-based program mutation engine, which pre-processes the input code if the input follows specific patterns and post-process the output if the output follows certain patterns. We show that our code processing tool, in conjunction with the program translator, can form a hybrid program translator and significantly improve the state-of-the-art. In the future, we envision an end-to-end program translation tool where programming domain knowledge can be embedded into an ML-based translation pipeline using pre- and post-processing steps. △ Less

Submitted 21 February, 2023; originally announced February 2023.

Comments: 5 pages, 2 figures. Accepted at ICSE 2023 NIER - New Ideas and Emerging Results

arXiv:2212.10264 [pdf, other]

ReCode: Robustness Evaluation of Code Generation Models

Authors: Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, Bing Xiang

Abstract: Code generation models have achieved impressive performance. However, they tend to be brittle as slight edits to a prompt could lead to very different generations; these robustness properties, critical for user experience when deployed in real-life applications, are not well understood. Most existing works on robustness in text or code tasks have focused on classification, while robustness in gene… ▽ More Code generation models have achieved impressive performance. However, they tend to be brittle as slight edits to a prompt could lead to very different generations; these robustness properties, critical for user experience when deployed in real-life applications, are not well understood. Most existing works on robustness in text or code tasks have focused on classification, while robustness in generation tasks is an uncharted area and to date there is no comprehensive benchmark for robustness in code generation. In this paper, we propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. They are carefully designed to be natural in real-life coding practice, preserve the original semantic meaning, and thus provide multifaceted assessments of a model's robustness performance. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt. In addition, we define robustness metrics for code generation models considering the worst-case behavior under each type of perturbation, taking advantage of the fact that executing the generated code can serve as objective evaluation. We demonstrate ReCode on SOTA models using HumanEval, MBPP, as well as function completion tasks derived from them. Interesting observations include: better robustness for CodeGen over InCoder and GPT-J; models are most sensitive to syntax perturbations; more challenging robustness evaluation on MBPP over HumanEval. △ Less

Submitted 20 December, 2022; originally announced December 2022.

Comments: Code and data available at https://github.com/amazon-science/recode

arXiv:2210.17366 [pdf, other]

Guided Conditional Diffusion for Controllable Traffic Simulation

Authors: Ziyuan Zhong, Davis Rempe, Danfei Xu, Yuxiao Chen, Sushant Veer, Tong Che, Baishakhi Ray, Marco Pavone

Abstract: Controllable and realistic traffic simulation is critical for developing and verifying autonomous vehicles. Typical heuristic-based traffic models offer flexible control to make vehicles follow specific trajectories and traffic rules. On the other hand, data-driven approaches generate realistic and human-like behaviors, improving transfer from simulated to real-world traffic. However, to the best… ▽ More Controllable and realistic traffic simulation is critical for developing and verifying autonomous vehicles. Typical heuristic-based traffic models offer flexible control to make vehicles follow specific trajectories and traffic rules. On the other hand, data-driven approaches generate realistic and human-like behaviors, improving transfer from simulated to real-world traffic. However, to the best of our knowledge, no traffic model offers both controllability and realism. In this work, we develop a conditional diffusion model for controllable traffic generation (CTG) that allows users to control desired properties of trajectories at test time (e.g., reach a goal or follow a speed limit) while maintaining realism and physical feasibility through enforced dynamics. The key technical idea is to leverage recent advances from diffusion modeling and differentiable logic to guide generated trajectories to meet rules defined using signal temporal logic (STL). We further extend guidance to multi-agent settings and enable interaction-based rules like collision avoidance. CTG is extensively evaluated on the nuScenes dataset for diverse and composite rules, demonstrating improvement over strong baselines in terms of the controllability-realism tradeoff. △ Less

Submitted 31 October, 2022; originally announced October 2022.

arXiv:2210.14868 [pdf, other]

Multi-lingual Evaluation of Code Generation Models

Authors: Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, Bing Xiang

Abstract: We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the perform… ▽ More We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities even on mono-lingual settings. Furthermore, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks. Overall, our benchmarks represents a significant step towards a deeper understanding of language models' code generation abilities. We publicly release our code and datasets at https://github.com/amazon-research/mxeval. △ Less

Submitted 28 March, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

Comments: Code and data release: https://github.com/amazon-research/mxeval

arXiv:2210.14250 [pdf, other]

Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature

Authors: Katherine Thai, Marzena Karpinska, Kalpesh Krishna, Bill Ray, Moira Inghilleri, John Wieting, Mohit Iyyer

Abstract: Literary translation is a culturally significant task, but it is bottlenecked by the small number of qualified literary translators relative to the many untranslated works published around the world. Machine translation (MT) holds potential to complement the work of human translators by improving both training procedures and their overall efficiency. Literary translation is less constrained than m… ▽ More Literary translation is a culturally significant task, but it is bottlenecked by the small number of qualified literary translators relative to the many untranslated works published around the world. Machine translation (MT) holds potential to complement the work of human translators by improving both training procedures and their overall efficiency. Literary translation is less constrained than more traditional MT settings since translators must balance meaning equivalence, readability, and critical interpretability in the target language. This property, along with the complex discourse-level context present in literary texts, also makes literary MT more challenging to computationally model and evaluate. To explore this task, we collect a dataset (Par3) of non-English language novels in the public domain, each aligned at the paragraph level to both human and automatic English translations. Using Par3, we discover that expert literary translators prefer reference human translations over machine-translated paragraphs at a rate of 84%, while state-of-the-art automatic MT metrics do not correlate with those preferences. The experts note that MT outputs contain not only mistranslations, but also discourse-disrupting errors and stylistic inconsistencies. To address these problems, we train a post-editing model whose output is preferred over normal MT output at a rate of 69% by experts. We publicly release Par3 at https://github.com/katherinethai/par3/ to spur future research into literary MT. △ Less

Submitted 25 October, 2022; originally announced October 2022.

Comments: EMNLP 2022

arXiv:2210.02853 [pdf, other]

doi 10.1145/3540250.3549147

NeuDep: Neural Binary Memory Dependence Analysis

Authors: Kexin Pei, Dongdong She, Michael Wang, Scott Geng, Zhou Xuan, Yaniv David, Junfeng Yang, Suman Jana, Baishakhi Ray

Abstract: Determining whether multiple instructions can access the same memory location is a critical task in binary analysis. It is challenging as statically computing precise alias information is undecidable in theory. The problem aggravates at the binary level due to the presence of compiler optimizations and the absence of symbols and types. Existing approaches either produce significant spurious depend… ▽ More Determining whether multiple instructions can access the same memory location is a critical task in binary analysis. It is challenging as statically computing precise alias information is undecidable in theory. The problem aggravates at the binary level due to the presence of compiler optimizations and the absence of symbols and types. Existing approaches either produce significant spurious dependencies due to conservative analysis or scale poorly to complex binaries. We present a new machine-learning-based approach to predict memory dependencies by exploiting the model's learned knowledge about how binary programs execute. Our approach features (i) a self-supervised procedure that pretrains a neural net to reason over binary code and its dynamic value flows through memory addresses, followed by (ii) supervised finetuning to infer the memory dependencies statically. To facilitate efficient learning, we develop dedicated neural architectures to encode the heterogeneous inputs (i.e., code, data values, and memory addresses from traces) with specific modules and fuse them with a composition learning strategy. We implement our approach in NeuDep and evaluate it on 41 popular software projects compiled by 2 compilers, 4 optimizations, and 4 obfuscation passes. We demonstrate that NeuDep is more precise (1.5x) and faster (3.5x) than the current state-of-the-art. Extensive probing studies on security-critical reverse engineering tasks suggest that NeuDep understands memory access patterns, learns function signatures, and is able to match indirect calls. All these tasks either assist or benefit from inferring memory dependencies. Notably, NeuDep also outperforms the current state-of-the-art on these tasks. △ Less

Submitted 4 October, 2022; originally announced October 2022.

Comments: ESEC/FSE 2022

arXiv:2210.01185 [pdf, other]

ContraCLM: Contrastive Learning For Causal Language Model

Authors: Nihal Jain, Dejiao Zhang, Wasi Uddin Ahmad, Zijian Wang, Feng Nan, Xiaopeng Li, Ming Tan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Xiaofei Ma, Bing Xiang

Abstract: Despite exciting progress in causal language models, the expressiveness of the representations is largely limited due to poor discrimination ability. To remedy this issue, we present ContraCLM, a novel contrastive learning framework at both token-level and sequence-level. We assess ContraCLM on a variety of downstream tasks. We show that ContraCLM enhances discrimination of the representations and… ▽ More Despite exciting progress in causal language models, the expressiveness of the representations is largely limited due to poor discrimination ability. To remedy this issue, we present ContraCLM, a novel contrastive learning framework at both token-level and sequence-level. We assess ContraCLM on a variety of downstream tasks. We show that ContraCLM enhances discrimination of the representations and bridges the gap with the encoder-only models, which makes causal language models better suited for tasks beyond language generation. Specifically, we attain $44\%$ relative improvement on the Semantic Textual Similarity tasks and $34\%$ on Code-to-Code Search tasks. Furthermore, by improving the expressiveness of the representations, ContraCLM also boosts the source code generation capability with $9\%$ relative improvement on execution accuracy on the HumanEval benchmark. △ Less

Submitted 2 May, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: 10 pages

Journal ref: ACL 2023

arXiv:2209.14921 [pdf]

IvySyn: Automated Vulnerability Discovery in Deep Learning Frameworks

Authors: Neophytos Christou, Di Jin, Vaggelis Atlidakis, Baishakhi Ray, Vasileios P. Kemerlis

Abstract: We present IvySyn, the first fully-automated framework for discovering memory error vulnerabilities in Deep Learning (DL) frameworks. IvySyn leverages the statically-typed nature of native APIs in order to automatically perform type-aware mutation-based fuzzing on low-level kernel code. Given a set of offending inputs that trigger memory safety (and runtime) errors in low-level, native DL (C/C++)… ▽ More We present IvySyn, the first fully-automated framework for discovering memory error vulnerabilities in Deep Learning (DL) frameworks. IvySyn leverages the statically-typed nature of native APIs in order to automatically perform type-aware mutation-based fuzzing on low-level kernel code. Given a set of offending inputs that trigger memory safety (and runtime) errors in low-level, native DL (C/C++) code, IvySyn automatically synthesizes code snippets in high-level languages (e.g., in Python), which propagate error-triggering input via high(er)-level APIs. Such code snippets essentially act as "Proof of Vulnerability", as they demonstrate the existence of bugs in native code that an attacker can target through various high-level APIs. Our evaluation shows that IvySyn significantly outperforms past approaches, both in terms of efficiency and effectiveness, in finding vulnerabilities in popular DL frameworks. Specifically, we used IvySyn to test TensorFlow and PyTorch. Although still an early prototype, IvySyn has already helped the TensorFlow and PyTorch framework developers to identify and fix 61 previously-unknown security vulnerabilities, and assign 39 unique CVEs. △ Less

Submitted 27 April, 2023; v1 submitted 29 September, 2022; originally announced September 2022.

Comments: Accepted at USENIX Security 2023

arXiv:2207.11784 [pdf, other]

CARGO: AI-Guided Dependency Analysis for Migrating Monolithic Applications to Microservices Architecture

Authors: Vikram Nitin, Shubhi Asthana, Baishakhi Ray, Rahul Krishna

Abstract: Microservices Architecture (MSA) has become a de-facto standard for designing cloud-native enterprise applications due to its efficient infrastructure setup, service availability, elastic scalability, dependability, and better security. Existing (monolithic) systems must be decomposed into microservices to harness these characteristics. Since manual decomposition of large scale applications can be… ▽ More Microservices Architecture (MSA) has become a de-facto standard for designing cloud-native enterprise applications due to its efficient infrastructure setup, service availability, elastic scalability, dependability, and better security. Existing (monolithic) systems must be decomposed into microservices to harness these characteristics. Since manual decomposition of large scale applications can be laborious and error-prone, AI-based systems to detect decomposition strategies are gaining popularity. However, the usefulness of these approaches is limited by the expressiveness of the program representation and their inability to model the application's dependency on critical external resources such as databases. Consequently, partitioning recommendations offered by current tools result in architectures that result in (a) distributed monoliths, and/or (b) force the use of (often criticized) distributed transactions. This work attempts to overcome these challenges by introducing CARGO({short for [C]ontext-sensitive l[A]bel p[R]opa[G]ati[O]n})-a novel un-/semi-supervised partition refinement technique that uses a context- and flow-sensitive system dependency graph of the monolithic application to refine and thereby enrich the partitioning quality of the current state-of-the-art algorithms. CARGO was used to augment four state-of-the-art microservice partitioning techniques that were applied on five Java EE applications (including one industrial scale proprietary project). Experiments demostrate that CARGO can improve the partition quality of all modern microservice partitioning techniques. Further, CARGO substantially reduces distributed transactions and a real-world performance evaluation of a benchmark application (deployed under varying loads) shows that CARGO also lowers the overall the latency of the deployed microservice application by 11% and increases throughput by 120% on average. △ Less

Submitted 6 October, 2022; v1 submitted 24 July, 2022; originally announced July 2022.

Comments: ACM Distinguished Paper ASE '22, October 10-14, 2022, Ann Arbor, MI, USA

ACM Class: D.2.11

arXiv:2206.09357 [pdf, other]

Automatic Map Generation for Autonomous Driving System Testing

Authors: Yun Tang, Yuan Zhou, Kairui Yang, Ziyuan Zhong, Baishakhi Ray, Yang Liu, Ping Zhang, Junbo Chen

Abstract: High-definition (HD) maps are essential in testing autonomous driving systems (ADSs). HD maps essentially determine the potential diversity of the testing scenarios. However, the current HD maps suffer from two main limitations: lack of junction diversity in the publicly available HD maps and cost-consuming to build a new HD map. Hence, in this paper, we propose, FEAT2MAP, to automatically generat… ▽ More High-definition (HD) maps are essential in testing autonomous driving systems (ADSs). HD maps essentially determine the potential diversity of the testing scenarios. However, the current HD maps suffer from two main limitations: lack of junction diversity in the publicly available HD maps and cost-consuming to build a new HD map. Hence, in this paper, we propose, FEAT2MAP, to automatically generate concise HD maps with scenario diversity guarantees. FEAT2MAP focuses on junctions as they significantly influence scenario diversity, especially in urban road networks. FEAT2MAP first defines a set of features to characterize junctions. Then, FEAT2MAP extracts and samples concrete junction features from a list of input HD maps or user-defined requirements. Each junction feature generates a junction. Finally, FEAT2MAP builds a map by connecting the junctions in a grid layout. To demonstrate the effectiveness of FEAT2MAP, we conduct experiments with the public HD maps from SVL and the open-source ADS Apollo. The results show that FEAT2MAP can (1) generate new maps of reduced size while maintaining scenario diversity in terms of the code coverage and motion states of the ADS under test, and (2) generate new maps of increased scenario diversity by merging intersection features from multiple maps or taking user inputs. △ Less

Submitted 19 June, 2022; originally announced June 2022.

Comments: 7 pages, 7 figures

arXiv:2206.07585 [pdf, other]

NatGen: Generative pre-training by "Naturalizing" source code

Authors: Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar Devanbu, Baishakhi Ray

Abstract: Pre-trained Generative Language models (e.g. PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to learn statistics of code construction from very large-scale corpora in a self-supervised fashion; the success of pre-trained models largely hinges… ▽ More Pre-trained Generative Language models (e.g. PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training objectives to learn statistics of code construction from very large-scale corpora in a self-supervised fashion; the success of pre-trained models largely hinges on these pre-training objectives. This paper proposes a new pre-training objective, "Naturalizing" of source code, exploiting code's bimodal, dual-channel (formal & natural channels) nature. Unlike natural language, code's bimodal, dual-channel nature allows us to generate semantically equivalent code at scale. We introduce six classes of semantic preserving transformations to introduce un-natural forms of code, and then force our model to produce more natural original programs written by developers. Learning to generate equivalent, but more natural code, at scale, over large corpora of open-source code, without explicit manual supervision, helps the model learn to both ingest & generate code. We fine-tune our model in three generative Software Engineering tasks: code generation, code translation, and code refinement with limited human-curated labeled data and achieve state-of-the-art performance rivaling CodeT5. We show that our pre-trained model is especially competitive at zero-shot and few-shot learning, and better at learning code properties (e.g., syntax, data flow). △ Less

Submitted 5 July, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

Comments: Accepted to be published in ESEC/FSE 2022

arXiv:2205.11116 [pdf, other]

Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages

Authors: Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang

Abstract: Back-translation is widely known for its effectiveness in neural machine translation when there is little to no parallel data. In this approach, a source-to-target model is coupled with a target-to-source model trained in parallel. The target-to-source model generates noisy sources, while the source-to-target model is trained to reconstruct the targets and vice versa. Recent developments of multil… ▽ More Back-translation is widely known for its effectiveness in neural machine translation when there is little to no parallel data. In this approach, a source-to-target model is coupled with a target-to-source model trained in parallel. The target-to-source model generates noisy sources, while the source-to-target model is trained to reconstruct the targets and vice versa. Recent developments of multilingual pre-trained sequence-to-sequence models for programming languages have been very effective for a broad spectrum of downstream software engineering tasks. Hence, training them to build programming language translation systems via back-translation is compelling. However, these models cannot be further trained via back-translation since they learn to output sequences in the same language as the inputs during pre-training. As an alternative, we propose performing back-translation via code summarization and generation. In code summarization, a model learns to generate natural language (NL) summaries given code snippets. In code generation, the model learns to do the opposite. Therefore, target-to-source generation in back-translation can be viewed as a target-to-NL-to-source generation. We show that our proposed approach performs competitively with state-of-the-art methods. We have made the code publicly available. △ Less

Submitted 11 February, 2023; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: Accepted to EACL 2023 (Main)

arXiv:2203.13612 [pdf, other]

Repairing Group-Level Errors for DNNs Using Weighted Regularization

Authors: Ziyuan Zhong, Yuchi Tian, Conor J. Sweeney, Vicente Ordonez, Baishakhi Ray

Abstract: Deep Neural Networks (DNNs) have been widely used in software making decisions impacting people's lives. However, they have been found to exhibit severe erroneous behaviors that may lead to unfortunate outcomes. Previous work shows that such misbehaviors often occur due to class property violations rather than errors on a single image. Although methods for detecting such errors have been proposed,… ▽ More Deep Neural Networks (DNNs) have been widely used in software making decisions impacting people's lives. However, they have been found to exhibit severe erroneous behaviors that may lead to unfortunate outcomes. Previous work shows that such misbehaviors often occur due to class property violations rather than errors on a single image. Although methods for detecting such errors have been proposed, fixing them has not been studied so far. Here, we propose a generic method called Weighted Regularization (WR) consisting of five concrete methods targeting the error-producing classes to fix the DNNs. In particular, it can repair confusion error and bias error of DNN models for both single-label and multi-label image classifications. A confusion error happens when a given DNN model tends to confuse between two classes. Each method in WR assigns more weights at a stage of DNN retraining or inference to mitigate the confusion between target pair. A bias error can be fixed similarly. We evaluate and compare the proposed methods along with baselines on six widely-used datasets and architecture combinations. The results suggest that WR methods have different trade-offs but under each setting at least one WR method can greatly reduce confusion/bias errors at a very limited cost of the overall performance. △ Less

Submitted 4 April, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

arXiv:2203.11320 [pdf, other]

Valley engineering electron-hole liquids in TMDC monolayers

Authors: Arnab Barman Ray, Kevin Liang, Nick Vamivakas

Abstract: Electron-hole liquids(EHLs), a correlated state of matter and a thermodynamic liquid, have recently been found to exist at room temperature in suspended monolayers of MoS2. Appreciably higher rates of radiative recombination inside the liquid as compared to free excitons hold promise for optoelectronic applications such as broadband lasing. In this paper, we show that leveraging the valley physics… ▽ More Electron-hole liquids(EHLs), a correlated state of matter and a thermodynamic liquid, have recently been found to exist at room temperature in suspended monolayers of MoS2. Appreciably higher rates of radiative recombination inside the liquid as compared to free excitons hold promise for optoelectronic applications such as broadband lasing. In this paper, we show that leveraging the valley physics in MoS2 may be a route towards achieving tunability of specific characteristics of an EHL, such as emission wavelength, linewidth, and most importantly, the liquid density. The conditions under which EHLs form, in bulk semiconductors as well as TMDC monolayers are quite stringent, requiring high crystal purity and cryogenic temperatures in bulk semiconductors, and suspension in monolayers. Using a simple yet powerful model for describing free excitons and show that a phase transition into the EHL state may be feasible in substrate-supported monolayer samples. More repeatable experimental realizations of EHLs may be essential to answer questions regarding the nature of electron-hole correlations and how they may be used to generate non-trivial states of light. △ Less

Submitted 21 March, 2022; originally announced March 2022.

Comments: 15 pages, 5 figures, unpublished

arXiv:2201.08413 [pdf, other]

Unicorn: Reasoning about Configurable System Performance through the lens of Causality

Authors: Md Shahriar Iqbal, Rahul Krishna, Mohammad Ali Javidian, Baishakhi Ray, Pooyan Jamshidi

Abstract: Modern computer systems are highly configurable, with the total variability space sometimes larger than the number of atoms in the universe. Understanding and reasoning about the performance behavior of highly configurable systems, over a vast and variable space, is challenging. State-of-the-art methods for performance modeling and analyses rely on predictive machine learning models, therefore, th… ▽ More Modern computer systems are highly configurable, with the total variability space sometimes larger than the number of atoms in the universe. Understanding and reasoning about the performance behavior of highly configurable systems, over a vast and variable space, is challenging. State-of-the-art methods for performance modeling and analyses rely on predictive machine learning models, therefore, they become (i) unreliable in unseen environments (e.g., different hardware, workloads), and (ii) may produce incorrect explanations. To tackle this, we propose a new method, called Unicorn, which (i) captures intricate interactions between configuration options across the software-hardware stack and (ii) describes how such interactions can impact performance variations via causal inference. We evaluated Unicorn on six highly configurable systems, including three on-device machine learning systems, a video encoder, a database management system, and a data analytics pipeline. The experimental results indicate that Unicorn outperforms state-of-the-art performance debugging and optimization methods in finding effective repairs for performance faults and finding configurations with near-optimal performance. Further, unlike the existing methods, the learned causal performance models reliably predict performance for new environments. △ Less

Submitted 17 March, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

Comments: EuroSys 2022 (camera-ready)

arXiv:2201.03758 [pdf, other]

Predictive Synthesis of API-Centric Code

Authors: Daye Nam, Baishakhi Ray, Seohyun Kim, Xianshan Qu, Satish Chandra

Abstract: Today's programmers, especially data science practitioners, make heavy use of data-processing libraries (APIs) such as PyTorch, Tensorflow, NumPy, Pandas, and the like. Program synthesizers can provide significant coding assistance to this community of users; however program synthesis also can be slow due to enormous search spaces. In this work, we examine ways in which machine learning can be use… ▽ More Today's programmers, especially data science practitioners, make heavy use of data-processing libraries (APIs) such as PyTorch, Tensorflow, NumPy, Pandas, and the like. Program synthesizers can provide significant coding assistance to this community of users; however program synthesis also can be slow due to enormous search spaces. In this work, we examine ways in which machine learning can be used to accelerate enumerative program synthesis. We present a deep-learning-based model to predict the sequence of API functions that would be needed to go from a given input to a desired output, both being numeric vectors. Our work is based on two insights. First, it is possible to learn, based on a large number of input-output examples, to predict the likely API function needed in a given situation. Second, and crucially, it is also possible to learn to compose API functions into a sequence, given an input and the desired final output, without explicitly knowing the intermediate values. We show that we can speed up an enumerative program synthesizer by using predictions from our model variants. These speedups significantly outperform previous ways (e.g. DeepCoder) in which researchers have used ML models in enumerative synthesis. △ Less

Submitted 17 May, 2022; v1 submitted 10 January, 2022; originally announced January 2022.

Showing 1–50 of 111 results for author: Ray, B