Search | arXiv e-print repository

Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming

Authors: Anisha Agarwal, Aaron Chan, Shubham Chandel, Jinu Jang, Shaun Miller, Roshanak Zilouchian Moghaddam, Yevhen Mohylevskyy, Neel Sundaresan, Michele Tufano

Abstract: The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given sce… ▽ More The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given scenario. Rather, each system requires the LLM to be honed to its set of heuristics to ensure the best performance. In this paper, we introduce the Copilot evaluation harness: a set of data and tools for evaluating LLM-guided IDE interactions, covering various programming scenarios and languages. We propose our metrics as a more robust and information-dense evaluation than previous state of the art evaluation systems. We design and compute both static and execution based success metrics for scenarios encompassing a wide range of developer tasks, including code generation from natural language (generate), documentation generation from code (doc), test case generation (test), bug-fixing (fix), and workspace understanding and query resolution (workspace). These success metrics are designed to evaluate the performance of LLMs within a given IDE and its respective parameter space. Our learnings from evaluating three common LLMs using these metrics can inform the development and validation of future scenarios in LLM guided IDEs. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2307.13383 [pdf, other]

Predicting Code Coverage without Execution

Authors: Michele Tufano, Shubham Chandel, Anisha Agarwal, Neel Sundaresan, Colin Clement

Abstract: Code coverage is a widely used metric for quantifying the extent to which program elements, such as statements or branches, are executed during testing. Calculating code coverage is resource-intensive, requiring code building and execution with additional overhead for the instrumentation. Furthermore, computing coverage of any snippet of code requires the whole program context. Using Machine Learn… ▽ More Code coverage is a widely used metric for quantifying the extent to which program elements, such as statements or branches, are executed during testing. Calculating code coverage is resource-intensive, requiring code building and execution with additional overhead for the instrumentation. Furthermore, computing coverage of any snippet of code requires the whole program context. Using Machine Learning to amortize this expensive process could lower the cost of code coverage by requiring only the source code context, and the task of code coverage prediction can be a novel benchmark for judging the ability of models to understand code. We propose a novel benchmark task called Code Coverage Prediction for Large Language Models (LLMs). We formalize this task to evaluate the capability of LLMs in understanding code execution by determining which lines of a method are executed by a given test case and inputs. We curate and release a dataset we call COVERAGEEVAL by executing tests and code from the HumanEval dataset and collecting code coverage information. We report the performance of four state-of-the-art LLMs used for code-related tasks, including OpenAI's GPT-4 and GPT-3.5-Turbo, Google's BARD, and Anthropic's Claude, on the Code Coverage Prediction task. Finally, we argue that code coverage as a metric and pre-training data source are valuable for overall LLM performance on software engineering tasks. △ Less

Submitted 25 July, 2023; originally announced July 2023.

arXiv:2304.00913 [pdf, other]

LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate Speech Identification

Authors: Ankit Yadav, Shubham Chandel, Sushant Chatufale, Anil Bandhakavi

Abstract: Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual hate speech analysis dataset for English, Hindi, Arabic, French, German and Spanish languages for multiple domains across hate speech - Abuse, Racism, Sexism, Religious Hate and Extremism. To the best of our knowledge, this paper is the fi… ▽ More Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual hate speech analysis dataset for English, Hindi, Arabic, French, German and Spanish languages for multiple domains across hate speech - Abuse, Racism, Sexism, Religious Hate and Extremism. To the best of our knowledge, this paper is the first to address the problem of identifying various types of hate speech in these five wide domains in these six languages. In this work, we describe how we created the dataset, created annotations at high level and low level for different domains and how we use it to test the current state-of-the-art multilingual and multitask learning approaches. We evaluate our dataset in various monolingual, cross-lingual and machine translation classification settings and compare it against open source English datasets that we aggregated and merged for this task. Then we discuss how this approach can be used to create large scale hate-speech datasets and how to leverage our annotations in order to improve hate speech detection and classification in general. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2201.12901 [pdf, other]

Training and Evaluating a Jupyter Notebook Data Science Assistant

Authors: Shubham Chandel, Colin B. Clement, Guillermo Serrato, Neel Sundaresan

Abstract: We study the feasibility of a Data Science assistant powered by a sequence-to-sequence transformer by training a new model JuPyT5 on all publicly available Jupyter Notebook GitHub repositories and developing a new metric: Data Science Problems (DSP). DSP is a collection of 1119 problems curated from 306 pedagogical notebooks with 92 dataset dependencies, natural language and Markdown problem descr… ▽ More We study the feasibility of a Data Science assistant powered by a sequence-to-sequence transformer by training a new model JuPyT5 on all publicly available Jupyter Notebook GitHub repositories and developing a new metric: Data Science Problems (DSP). DSP is a collection of 1119 problems curated from 306 pedagogical notebooks with 92 dataset dependencies, natural language and Markdown problem descriptions, and assert-based unit tests. These notebooks were designed to test university students' mastery of various Python implementations of Math and Data Science, and we now leverage them to study the ability of JuPyT5 to understand and pass the tests. We analyze the content of DSP, validate its quality, and we find that given 100 sampling attempts JuPyT5 is able to solve 77.5\% of the DSP problems. We further present various ablation and statistical analyses and compare DSP to other recent natural language to code benchmarks. △ Less

Submitted 30 January, 2022; originally announced January 2022.

arXiv:1407.3987 [pdf]

Routing Attacks in Wireless Sensor Networks: A Survey

Authors: Deepali Virmani, Ankita Soni, Shringarica Chandel, Manas Hemrajani

Abstract: Wireless Sensor Networks (WSN) is an emerging technology now-a-days and has a wide range of applications such as battlefield surveillance, traffic surveillance, forest fire detection, flood detection etc. But wireless sensor networks are susceptible to a variety of potential attacks which obstructs the normal operation of the network. The security of a wireless sensor network is compromised becaus… ▽ More Wireless Sensor Networks (WSN) is an emerging technology now-a-days and has a wide range of applications such as battlefield surveillance, traffic surveillance, forest fire detection, flood detection etc. But wireless sensor networks are susceptible to a variety of potential attacks which obstructs the normal operation of the network. The security of a wireless sensor network is compromised because of the random deployment of sensor nodes in open environment, memory limitations, power limitations and unattended nature. This paper focuses on various attacks that manifest in the network and provides a tabular representation of the attacks, their effects and severity. The paper depicts a comparison of attacks basis packet loss and packet corruption. Also, the paper discusses the known defence mechanisms and countermeasures against the attacks. △ Less

Submitted 29 May, 2014; originally announced July 2014.

Comments: IJCSIT April 2014

arXiv:1401.2541 [pdf]

Exponential Trust Based Mechanism to Detect Black Hole attack in Wireless Sensor Network

Authors: Dr. Deepali Virmani, Manas Hemrajani, Shringarica Chandel

Abstract: Security is a key feature in Wireless Sensor Networks but they are prone to many kinds of attacks and one of them is Black Hole Attack. In a black hole attack all the packets are consecutively dropped which leads to the decrease in the efficiency of the network and unnecessary wastage of battery life. In this paper, we propose an exponential trust based mechanism to detect the malicious node. In t… ▽ More Security is a key feature in Wireless Sensor Networks but they are prone to many kinds of attacks and one of them is Black Hole Attack. In a black hole attack all the packets are consecutively dropped which leads to the decrease in the efficiency of the network and unnecessary wastage of battery life. In this paper, we propose an exponential trust based mechanism to detect the malicious node. In the proposed method a Streak counter is deployed to store the consecutive number of packets dropped and a trust factor is maintained for each node. The trust factor drops exponentially with each consecutive packet dropped which helps in detecting the malicious node. The proposed method show a drastic decrease in the number of packets dropped before the node being detected as a malicious node. △ Less

Submitted 11 January, 2014; originally announced January 2014.

Comments: 5 pages, 2 figures

Showing 1–6 of 6 results for author: Chandel, S