Search | arXiv e-print repository

arXiv:2408.14589 [pdf, other]

Wandercode: An Interaction Design for Code Recommenders to Reduce Information Overload, Ease Exploration, and Save Screen Space

Authors: Austin Z. Henley, David Shepherd, Scott D. Fleming

Abstract: In this paper, we present Wandercode, a novel interaction design for recommender systems that recommend code locations to aid programmers in software development tasks. In particular, our design aims to improve upon prior designs by reducing information overload, by better supporting the exploration of recommendations, and by making more efficient use of screen space. During our design process, we… ▽ More In this paper, we present Wandercode, a novel interaction design for recommender systems that recommend code locations to aid programmers in software development tasks. In particular, our design aims to improve upon prior designs by reducing information overload, by better supporting the exploration of recommendations, and by making more efficient use of screen space. During our design process, we developed a set of design dimensions to aid others in the design of code recommenders. To validate our design, we implemented a prototype of our design as an Atom code editor extension with support for the Java programming language, and conducted an empirical user evaluation comparing our graph-based Wandercode design to a control design representative of prior list-based interaction designs for code recommenders. The results showed that, compared with the control design, Wandercode helped participants complete tasks more quickly, reduced their cognitive load, and was viewed more favorably by participants. △ Less

Submitted 26 August, 2024; originally announced August 2024.

arXiv:2407.02651 [pdf, other]

doi 10.1145/3654777.3676345

Improving Steering and Verification in AI-Assisted Data Analysis with Interactive Task Decomposition

Authors: Majeed Kazemitabaar, Jack Williams, Ian Drosos, Tovi Grossman, Austin Henley, Carina Negreanu, Advait Sarkar

Abstract: LLM-powered tools like ChatGPT Data Analysis, have the potential to help users tackle the challenging task of data analysis programming, which requires expertise in data processing, programming, and statistics. However, our formative study (n=15) uncovered serious challenges in verifying AI-generated results and steering the AI (i.e., guiding the AI system to produce the desired output). We develo… ▽ More LLM-powered tools like ChatGPT Data Analysis, have the potential to help users tackle the challenging task of data analysis programming, which requires expertise in data processing, programming, and statistics. However, our formative study (n=15) uncovered serious challenges in verifying AI-generated results and steering the AI (i.e., guiding the AI system to produce the desired output). We developed two contrasting approaches to address these challenges. The first (Stepwise) decomposes the problem into step-by-step subgoals with pairs of editable assumptions and code until task completion, while the second (Phasewise) decomposes the entire problem into three editable, logical phases: structured input/output assumptions, execution plan, and code. A controlled, within-subjects experiment (n=18) compared these systems against a conversational baseline. Users reported significantly greater control with the Stepwise and Phasewise systems, and found intervention, correction, and verification easier, compared to the baseline. The results suggest design guidelines and trade-offs for AI-assisted data analysis tools. △ Less

Submitted 1 August, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

Comments: Published at UIST 2024; 19 pages, 9 figures, and 2 tables

Journal ref: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST 2024)

arXiv:2405.01556 [pdf, other]

Semantically Aligned Question and Code Generation for Automated Insight Generation

Authors: Ananya Singha, Bhavya Chopra, Anirudh Khatry, Sumit Gulwani, Austin Z. Henley, Vu Le, Chris Parnin, Mukul Singh, Gust Verbruggen

Abstract: Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language models can generate code that does not correctly correspond (or align) to the insight. In this paper, we leverage the semantic knowledge of large language models to… ▽ More Automated insight generation is a common tactic for helping knowledge workers, such as data scientists, to quickly understand the potential value of new and unfamiliar data. Unfortunately, automated insights produced by large-language models can generate code that does not correctly correspond (or align) to the insight. In this paper, we leverage the semantic knowledge of large language models to generate targeted and insightful questions about data and the corresponding code to answer those questions. Then through an empirical study on data from Open-WikiTable, we show that embeddings can be effectively used for filtering out semantically unaligned pairs of question and code. Additionally, we found that generating questions and code together yields more diverse questions. △ Less

Submitted 21 March, 2024; originally announced May 2024.

arXiv:2403.07762 [pdf, other]

Supporting Annotators with Affordances for Efficiently Labeling Conversational Data

Authors: Austin Z. Henley, David Piorkowski

Abstract: Without well-labeled ground truth data, machine learning-based systems would not be as ubiquitous as they are today, but these systems rely on substantial amounts of correctly labeled data. Unfortunately, crowdsourced labeling is time consuming and expensive. To address the concerns of effort and tedium, we designed CAL, a novel interface to aid in data labeling. We made several key design decisio… ▽ More Without well-labeled ground truth data, machine learning-based systems would not be as ubiquitous as they are today, but these systems rely on substantial amounts of correctly labeled data. Unfortunately, crowdsourced labeling is time consuming and expensive. To address the concerns of effort and tedium, we designed CAL, a novel interface to aid in data labeling. We made several key design decisions for CAL, which include preventing inapt labels from being selected, guiding users in selecting an appropriate label when they need assistance, incorporating labeling documentation into the interface, and providing an efficient means to view previous labels. We implemented a production-quality implementation of CAL and report a user-study evaluation that compares CAL to a standard spreadsheet. Key findings of our study include users using CAL reported lower cognitive load, did not increase task time, users rated CAL to be easier to use, and users preferred CAL over the spreadsheet. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2401.11314 [pdf, other]

doi 10.1145/3613904.3642773

CodeAid: Evaluating a Classroom Deployment of an LLM-based Programming Assistant that Balances Student and Educator Needs

Authors: Majeed Kazemitabaar, Runlong Ye, Xiaoning Wang, Austin Z. Henley, Paul Denny, Michelle Craig, Tovi Grossman

Abstract: Timely, personalized feedback is essential for students learning programming. LLM-powered tools like ChatGPT offer instant support, but reveal direct answers with code, which may hinder deep conceptual engagement. We developed CodeAid, an LLM-powered programming assistant delivering helpful, technically correct responses, without revealing code solutions. CodeAid answers conceptual questions, gene… ▽ More Timely, personalized feedback is essential for students learning programming. LLM-powered tools like ChatGPT offer instant support, but reveal direct answers with code, which may hinder deep conceptual engagement. We developed CodeAid, an LLM-powered programming assistant delivering helpful, technically correct responses, without revealing code solutions. CodeAid answers conceptual questions, generates pseudo-code with line-by-line explanations, and annotates student's incorrect code with fix suggestions. We deployed CodeAid in a programming class of 700 students for a 12-week semester. A thematic analysis of 8,000 usages of CodeAid was performed, further enriched by weekly surveys, and 22 student interviews. We then interviewed eight programming educators to gain further insights. Our findings reveal four design considerations for future educational AI assistants: D1) exploiting AI's unique benefits; D2) simplifying query formulation while promoting cognitive engagement; D3) avoiding direct responses while encouraging motivated learning; and D4) maintaining transparency and control for students to asses and steer AI responses. △ Less

Submitted 25 February, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

Comments: CHI 2024 Paper - The paper includes 17 pages, 8 figures, 2 tables, along with a 2-page appendix

arXiv:2312.14231 [pdf, other]

Building Your Own Product Copilot: Challenges, Opportunities, and Needs

Authors: Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, Austin Z. Henley

Abstract: A race is underway to embed advanced AI capabilities into products. These product copilots enable users to ask questions in natural language and receive relevant responses that are specific to the user's context. In fact, virtually every large technology company is looking to add these capabilities to their software products. However, for most software engineers, this is often their first encounte… ▽ More A race is underway to embed advanced AI capabilities into products. These product copilots enable users to ask questions in natural language and receive relevant responses that are specific to the user's context. In fact, virtually every large technology company is looking to add these capabilities to their software products. However, for most software engineers, this is often their first encounter with integrating AI-powered technology. Furthermore, software engineering processes and tools have not caught up with the challenges and scale involved with building AI-powered applications. In this work, we present the findings of an interview study with 26 professional software engineers responsible for building product copilots at various companies. From our interviews, we found pain points at every step of the engineering process and the challenges that strained existing development practices. We then conducted group brainstorming sessions to collaborative on opportunities and tool designs for the broader software engineering community. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: 11 pages

arXiv:2310.16164 [pdf, other]

Conversational Challenges in AI-Powered Data Science: Obstacles, Needs, and Design Opportunities

Authors: Bhavya Chopra, Ananya Singha, Anna Fariha, Sumit Gulwani, Chris Parnin, Ashish Tiwari, Austin Z. Henley

Abstract: Large Language Models (LLMs) are being increasingly employed in data science for tasks like data preprocessing and analytics. However, data scientists encounter substantial obstacles when conversing with LLM-powered chatbots and acting on their suggestions and answers. We conducted a mixed-methods study, including contextual observations, semi-structured interviews (n=14), and a survey (n=114), to… ▽ More Large Language Models (LLMs) are being increasingly employed in data science for tasks like data preprocessing and analytics. However, data scientists encounter substantial obstacles when conversing with LLM-powered chatbots and acting on their suggestions and answers. We conducted a mixed-methods study, including contextual observations, semi-structured interviews (n=14), and a survey (n=114), to identify these challenges. Our findings highlight key issues faced by data scientists, including contextual data retrieval, formulating prompts for complex tasks, adapting generated code to local environments, and refining prompts iteratively. Based on these insights, we propose actionable design recommendations, such as data brushing to support context selection, and inquisitive feedback loops to improve communications with AI-based assistants in data-science tools. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: 24 pages, 8 figures

arXiv:2309.14049 [pdf, other]

How Novices Use LLM-Based Code Generators to Solve CS1 Coding Tasks in a Self-Paced Learning Environment

Authors: Majeed Kazemitabaar, Xinying Hou, Austin Henley, Barbara J. Ericson, David Weintrop, Tovi Grossman

Abstract: As Large Language Models (LLMs) gain in popularity, it is important to understand how novice programmers use them. We present a thematic analysis of 33 learners, aged 10-17, independently learning Python through 45 code-authoring tasks using Codex, an LLM-based code generator. We explore several questions related to how learners used these code generators and provide an analysis of the properties… ▽ More As Large Language Models (LLMs) gain in popularity, it is important to understand how novice programmers use them. We present a thematic analysis of 33 learners, aged 10-17, independently learning Python through 45 code-authoring tasks using Codex, an LLM-based code generator. We explore several questions related to how learners used these code generators and provide an analysis of the properties of the written prompts and the generated code. Specifically, we explore (A) the context in which learners use Codex, (B) what learners are asking from Codex, (C) properties of their prompts in terms of relation to task description, language, and clarity, and prompt crafting patterns, (D) the correctness, complexity, and accuracy of the AI-generated code, and (E) how learners utilize AI-generated code in terms of placement, verification, and manual modifications. Furthermore, our analysis reveals four distinct coding approaches when writing code with an AI code generator: AI Single Prompt, where learners prompted Codex once to generate the entire solution to a task; AI Step-by-Step, where learners divided the problem into parts and used Codex to generate each part; Hybrid, where learners wrote some of the code themselves and used Codex to generate others; and Manual coding, where learners wrote the code themselves. The AI Single Prompt approach resulted in the highest correctness scores on code-authoring tasks, but the lowest correctness scores on subsequent code-modification tasks during training. Our results provide initial insight into how novice learners use AI code generators and the challenges and opportunities associated with integrating them into self-paced learning environments. We conclude with various signs of over-reliance and self-regulation, as well as opportunities for curriculum and tool development. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: 12 pages, Peer-Reviewed, Accepted for publication in the proceedings of the 2023 ACM Koli Calling International Conference on Computing Education Research

arXiv:2210.05506 [pdf, other]

Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration

Authors: Matteo Paltenghi, Rahul Pandita, Austin Z. Henley, Albert Ziegler

Abstract: Recent neural models of code, such as OpenAI Codex and AlphaCode, have demonstrated remarkable proficiency at code generation due to the underlying attention mechanism. However, it often remains unclear how the models actually process code, and to what extent their reasoning and the way their attention mechanism scans the code matches the patterns of developers. A poor understanding of the model r… ▽ More Recent neural models of code, such as OpenAI Codex and AlphaCode, have demonstrated remarkable proficiency at code generation due to the underlying attention mechanism. However, it often remains unclear how the models actually process code, and to what extent their reasoning and the way their attention mechanism scans the code matches the patterns of developers. A poor understanding of the model reasoning process limits the way in which current neural models are leveraged today, so far mostly for their raw prediction. To fill this gap, this work studies how the processed attention signal of three open large language models - CodeGen, InCoder and GPT-J - agrees with how developers look at and explore code when each answers the same sensemaking questions about code. Furthermore, we contribute an open-source eye-tracking dataset comprising 92 manually-labeled sessions from 25 developers engaged in sensemaking tasks. We empirically evaluate five heuristics that do not use the attention and ten attention-based post-processing approaches of the attention signal of CodeGen against our ground truth of developers exploring code, including the novel concept of follow-up attention which exhibits the highest agreement between model and human attention. Our follow-up attention method can predict the next line a developer will look at with 47% accuracy. This outperforms the baseline prediction accuracy of 42.3%, which uses the session history of other developers to recommend the next line. These results demonstrate the potential of leveraging the attention signal of pre-trained models for effective code exploration. △ Less

Submitted 29 August, 2024; v1 submitted 11 October, 2022; originally announced October 2022.

Comments: Published at IEEE Transactions on Software Engineering

arXiv:2206.06260 [pdf, other]

OpenCBS: An Open-Source COBOL Defects Benchmark Suite

Authors: Dylan Lee, Austin Henley, Bill Hinshaw, Rahul Pandita

Abstract: As the current COBOL workforce retires, entry-level developers are left to keep complex legacy systems maintained and operational. This creates a massive gap in knowledge and ability as companies are having their veteran developers replaced with a new, inexperienced workforce. Additionally, the lack of COBOL and mainframe technology in the current academic curriculum further increases the learning… ▽ More As the current COBOL workforce retires, entry-level developers are left to keep complex legacy systems maintained and operational. This creates a massive gap in knowledge and ability as companies are having their veteran developers replaced with a new, inexperienced workforce. Additionally, the lack of COBOL and mainframe technology in the current academic curriculum further increases the learning curve for this new generation of developers. These issues are becoming even more pressing due to the business-critical nature of these systems, which makes migrating or replacing the mainframe and COBOL anytime soon very unlikely. As a result, there is now a huge need for tools and resources to increase new developers' code comprehension and ability to perform routine tasks such as debugging and defect location. Extensive work has been done in the software engineering field on the creation of such resources. However, the proprietary nature of COBOL and mainframe systems has restricted the amount of work and the number of open-source tools available for this domain. To address this issue, our work leverages the publicly available technical forum data to build an open-source collection of COBOL programs embodying issues/defects faced by COBOL developers. These programs were reconstructed and organized in a benchmark suite to facilitate the testing of developer tools. Our goal is to provide an open-source COBOL benchmark and testing suite that encourage community contribution and serve as a resource for researchers and tool-smiths in this domain. △ Less

Submitted 13 June, 2022; originally announced June 2022.

arXiv:2204.08108 [pdf, other]

How are Software Repositories Mined? A Systematic Literature Review of Workflows, Methodologies, Reproducibility, and Tools

Authors: Adam Tutko, Austin Z. Henley, Audris Mockus

Abstract: With the advent of open source software, a veritable treasure trove of previously proprietary software development data was made available. This opened the field of empirical software engineering research to anyone in academia. Data that is mined from software projects, however, requires extensive processing and needs to be handled with utmost care to ensure valid conclusions. Since the software d… ▽ More With the advent of open source software, a veritable treasure trove of previously proprietary software development data was made available. This opened the field of empirical software engineering research to anyone in academia. Data that is mined from software projects, however, requires extensive processing and needs to be handled with utmost care to ensure valid conclusions. Since the software development practices and tools have changed over two decades, we aim to understand the state-of-the-art research workflows and to highlight potential challenges. We employ a systematic literature review by sampling over one thousand papers from leading conferences and by analyzing the 286 most relevant papers from the perspective of data workflows, methodologies, reproducibility, and tools. We found that an important part of the research workflow involving dataset selection was particularly problematic, which raises questions about the generality of the results in existing literature. Furthermore, we found a considerable number of papers provide little or no reproducibility instructions -- a substantial deficiency for a data-intensive field. In fact, 33% of papers provide no information on how their data was retrieved. Based on these findings, we propose ways to address these shortcomings via existing tools and also provide recommendations to improve research workflows and the reproducibility of research. △ Less

Submitted 17 April, 2022; originally announced April 2022.

Comments: 11 Pages

MSC Class: 68N99

arXiv:2102.06098 [pdf, other]

An Inquisitive Code Editor for Addressing Novice Programmers' Misconceptions of Program Behavior

Authors: Austin Z. Henley, Julian Ball, Benjamin Klein, Aiden Rutter, Dylan Lee

Abstract: Novice programmers face numerous barriers while attempting to learn how to code that may deter them from pursuing a computer science degree or career in software development. In this work, we propose a tool concept to address the particularly challenging barrier of novice programmers holding misconceptions about how their code behaves. Specifically, the concept involves an inquisitive code editor… ▽ More Novice programmers face numerous barriers while attempting to learn how to code that may deter them from pursuing a computer science degree or career in software development. In this work, we propose a tool concept to address the particularly challenging barrier of novice programmers holding misconceptions about how their code behaves. Specifically, the concept involves an inquisitive code editor that: (1) identifies misconceptions by periodically prompting the novice programmer with questions about their program's behavior, (2) corrects the misconceptions by generating explanations based on the program's actual behavior, and (3) prevents further misconceptions by inserting test code and utilizing other educational resources. We have implemented portions of the concept as plugins for the Atom code editor and conducted informal surveys with students and instructors. Next steps include deploying the tool prototype to students enrolled in introductory programming courses. △ Less

Submitted 11 February, 2021; originally announced February 2021.

Comments: Accepted to ICSE-JSEET'21

arXiv:2011.06244 [pdf, other]

A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits

Authors: Steffen Herbold, Alexander Trautsch, Benjamin Ledel, Alireza Aghamohammadi, Taher Ahmed Ghaleb, Kuljit Kaur Chahal, Tim Bossenmaier, Bhaveet Nagaria, Philip Makedonski, Matin Nili Ahmadabadi, Kristof Szabados, Helge Spieker, Matej Madeja, Nathaniel Hoy, Valentina Lenarduzzi, Shangwen Wang, Gema Rodríguez-Pérez, Ricardo Colomo-Palacios, Roberto Verdecchia, Paramvir Singh, Yihao Qin, Debasish Chakroborti, Willard Davis, Vijay Walunj, Hongjun Wu , et al. (23 additional authors not shown)

Abstract: Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Metho… ▽ More Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise. △ Less

Submitted 13 October, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

Comments: Status: Accepted at Empirical Software Engineering

arXiv:2008.03439 [pdf, ps, other]

More Effective Software Repository Mining

Authors: Adam Tutko, Austin Henley, Audris Mockus

Abstract: Background: Data mining and analyzing of public Git software repositories is a growing research field. The tools used for studies that investigate a single project or a group of projects have been refined, but it is not clear whether the results obtained on such ``convenience samples'' generalize. Aims: This paper aims to elucidate the difficulties faced by researchers who would like to ascertain… ▽ More Background: Data mining and analyzing of public Git software repositories is a growing research field. The tools used for studies that investigate a single project or a group of projects have been refined, but it is not clear whether the results obtained on such ``convenience samples'' generalize. Aims: This paper aims to elucidate the difficulties faced by researchers who would like to ascertain the generalizability of their findings by introducing an interface that addresses the issues with obtaining representative samples. Results: To do that we explore how to exploit the World of Code system to make software repository sampling and analysis much more accessible. Specifically, we present a resource for Mining Software Repository researchers that is intended to simplify data sampling and retrieval workflow and, through that, increase the validity and completeness of data. Conclusions: This system has the potential to provide researchers a resource that greatly eases the difficulty of data retrieval and addresses many of the currently standing issues with data sampling. △ Less

Submitted 16 August, 2020; v1 submitted 8 August, 2020; originally announced August 2020.

Comments: 5 pages, 3 figures, Submitted to ESEM2020 Emerging Results track

Showing 1–14 of 14 results for author: Henley, A