Search | arXiv e-print repository

The Impact of Sanctions on GitHub Developers and Activities

Authors: Youmei Fan, Ani Hovhannisyan, Hideaki Hata, Christoph Treude, Raula Gaikovina Kula

Abstract: The GitHub platform has fueled the creation of truly global software, enabling contributions from developers across various geographical regions of the world. As software becomes more entwined with global politics and social regulations, it becomes similarly subject to government sanctions. In 2019, GitHub restricted access to certain services for users in specific locations but rolled back these… ▽ More The GitHub platform has fueled the creation of truly global software, enabling contributions from developers across various geographical regions of the world. As software becomes more entwined with global politics and social regulations, it becomes similarly subject to government sanctions. In 2019, GitHub restricted access to certain services for users in specific locations but rolled back these restrictions for some communities (e.g., the Iranian community) in 2021. We conducted a large-scale empirical study, collecting approximately 156 thousand user profiles and their 41 million activity points from 2008 to 2022, to understand the response of developers. Our results indicate that many of these targeted developers were able to navigate through the sanctions. Furthermore, once these sanctions were lifted, these developers opted to return to GitHub instead of withdrawing their contributions to the platform. The study indicates that platforms like GitHub play key roles in sustaining global contributions to Open Source Software. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2402.08967 [pdf, other]

doi 10.1145/3643773

Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions

Authors: Tao Xiao, Hideaki Hata, Christoph Treude, Kenichi Matsumoto

Abstract: GitHub's Copilot for Pull Requests (PRs) is a promising service aiming to automate various developer tasks related to PRs, such as generating summaries of changes or providing complete walkthroughs with links to the relevant code. As this innovative technology gains traction in the Open Source Software (OSS) community, it is crucial to examine its early adoption and its impact on the development p… ▽ More GitHub's Copilot for Pull Requests (PRs) is a promising service aiming to automate various developer tasks related to PRs, such as generating summaries of changes or providing complete walkthroughs with links to the relevant code. As this innovative technology gains traction in the Open Source Software (OSS) community, it is crucial to examine its early adoption and its impact on the development process. Additionally, it offers a unique opportunity to observe how developers respond when they disagree with the generated content. In our study, we employ a mixed-methods approach, blending quantitative analysis with qualitative insights, to examine 18,256 PRs in which parts of the descriptions were crafted by generative AI. Our findings indicate that: (1) Copilot for PRs, though in its infancy, is seeing a marked uptick in adoption. (2) PRs enhanced by Copilot for PRs require less review time and have a higher likelihood of being merged. (3) Developers using Copilot for PRs often complement the automated descriptions with their manual input. These results offer valuable insights into the growing integration of generative AI in software development. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2402.08920 [pdf, other]

doi 10.1007/s10664-024-10449-5

Quantifying and Characterizing Clones of Self-Admitted Technical Debt in Build Systems

Authors: Tao Xiao, Zhili Zeng, Dong Wang, Hideaki Hata, Shane McIntosh, Kenichi Matsumoto

Abstract: Self-Admitted Technical Debt (SATD) annotates development decisions that intentionally exchange long-term software artifact quality for short-term goals. Recent work explores the existence of SATD clones (duplicate or near duplicate SATD comments) in source code. Cloning of SATD in build systems (e.g., CMake and Maven) may propagate suboptimal design choices, threatening qualities of the build sys… ▽ More Self-Admitted Technical Debt (SATD) annotates development decisions that intentionally exchange long-term software artifact quality for short-term goals. Recent work explores the existence of SATD clones (duplicate or near duplicate SATD comments) in source code. Cloning of SATD in build systems (e.g., CMake and Maven) may propagate suboptimal design choices, threatening qualities of the build system that stakeholders rely upon (e.g., maintainability, reliability, repeatability). Hence, we conduct a large-scale study on 50,608 SATD comments extracted from Autotools, CMake, Maven, and Ant build systems to investigate the prevalence of SATD clones and to characterize their incidences. We observe that: (i) prior work suggests that 41-65% of SATD comments in source code are clones, but in our studied build system context, the rates range from 62% to 95%, suggesting that SATD clones are a more prevalent phenomenon in build systems than in source code; (ii) statements surrounding SATD clones are highly similar, with 76% of occurrences having similarity scores greater than 0.8; (iii) a quarter of SATD clones are introduced by the author of the original SATD statements; and (iv) among the most commonly cloned SATD comments, external factors (e.g., platform and tool configuration) are the most frequent locations, limitations in tools and libraries are the most frequent causes, and developers often copy SATD comments that describe issues to be fixed later. Our work presents the first step toward systematically understanding SATD clones in build systems and opens up avenues for future work, such as distinguishing different SATD clone behavior, as well as designing an automated recommendation system for repaying SATD effectively based on resolved clones. △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2401.16715 [pdf, ps, other]

Going Viral: Case Studies on the Impact of Protestware

Authors: Youmei Fan, Dong Wang, Supatsara Wattanakriengkrai, Hathaichanok Damrongsiri, Christoph Treude, Hideaki Hata, Raula Gaikovina Kula

Abstract: Maintainers are now self-sabotaging their work in order to take political or economic stances, a practice referred to as "protestware". In this poster, we present our approach to understand how the discourse about such an attack went viral, how it is received by the community, and whether developers respond to the attack in a timely manner. We study two notable protestware cases, i.e., Colors.js a… ▽ More Maintainers are now self-sabotaging their work in order to take political or economic stances, a practice referred to as "protestware". In this poster, we present our approach to understand how the discourse about such an attack went viral, how it is received by the community, and whether developers respond to the attack in a timely manner. We study two notable protestware cases, i.e., Colors.js and es5-ext, comparing with discussions of a typical security vulnerability as a baseline, i.e., Ua-parser, and perform a thematic analysis of more than two thousand protest-related posts to extract the different narratives when discussing protestware. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.02755 [pdf, other]

"My GitHub Sponsors profile is live!" Investigating the Impact of Twitter/X Mentions on GitHub Sponsors

Authors: Youmei Fan, Tao Xiao, Hideaki Hata, Christoph Treude, Kenichi Matsumoto

Abstract: GitHub Sponsors was launched in 2019, enabling donations to open-source software developers to provide financial support, as per GitHub's slogan: "Invest in the projects you depend on". However, a 2022 study on GitHub Sponsors found that only two-fifths of developers who were seeking sponsorship received a donation. The study found that, other than internal actions (such as offering perks to spons… ▽ More GitHub Sponsors was launched in 2019, enabling donations to open-source software developers to provide financial support, as per GitHub's slogan: "Invest in the projects you depend on". However, a 2022 study on GitHub Sponsors found that only two-fifths of developers who were seeking sponsorship received a donation. The study found that, other than internal actions (such as offering perks to sponsors), developers had advertised their GitHub Sponsors profiles on social media, such as Twitter (also known as X). Therefore, in this work, we investigate the impact of tweets that contain links to GitHub Sponsors profiles on sponsorship, as well as their reception on Twitter/X. We further characterize these tweets to understand their context and find that (1) such tweets have the impact of increasing the number of sponsors acquired, (2) compared to other donation platforms such as Open Collective and Patreon, GitHub Sponsors has significantly fewer interactions but is more visible on Twitter/X, and (3) developers tend to contribute more to open-source software during the week of posting such tweets. Our findings are the first step toward investigating the impact of social media on obtaining funding to sustain open-source software. △ Less

Submitted 5 January, 2024; originally announced January 2024.

arXiv:2309.15017 [pdf, other]

Studying the association between Gitcoin's issues and resolving outcomes

Authors: Morakot Choetkiertikul, Arada Puengmongkolchaikit, Pandaree Chandra, Chaiyong Ragkitwetsakul, Rungroj Maipradit, Hideaki Hata, Thanwadee Sunetnanta, Kenichi Matsumoto

Abstract: The development of open-source software (OSS) projects usually have been driven through collaborations among contributors and strongly relies on volunteering. Thus, allocating software practitioners (e.g., contributors) to a particular task is non-trivial and draws attention away from the development. Therefore, a number of bug bounty platforms have emerged to address this problem through bounty r… ▽ More The development of open-source software (OSS) projects usually have been driven through collaborations among contributors and strongly relies on volunteering. Thus, allocating software practitioners (e.g., contributors) to a particular task is non-trivial and draws attention away from the development. Therefore, a number of bug bounty platforms have emerged to address this problem through bounty rewards. Especially, Gitcoin, a new bounty platform, introduces a bounty reward mechanism that allows individual issue owners (backers) to define a reward value using cryptocurrencies rather than using crowdfunding mechanisms. Although a number of studies have investigated the phenomenon on bounty platforms, those rely on different bounty reward systems. Our study thus investigates the association between the Gitcoin bounties and their outcomes (i.e., success and non-success). We empirically study over 4,000 issues with Gitcoin bounties using statistical analysis and machine learning techniques. We also conducted a comparative study with the Bountysource platform to gain insights into the usage of both platforms. Our study highlights the importance of factors such as the length of the project, issue description, type of bounty issue, and the bounty value, which are found to be highly correlated with the outcome of bounty issues. These findings can provide useful guidance to practitioners. △ Less

Submitted 26 September, 2023; originally announced September 2023.

arXiv:2309.03914 [pdf, other]

doi 10.1145/3643991.3648400

DevGPT: Studying Developer-ChatGPT Conversations

Authors: Tao Xiao, Christoph Treude, Hideaki Hata, Kenichi Matsumoto

Abstract: This paper introduces DevGPT, a dataset curated to explore how software developers interact with ChatGPT, a prominent large language model (LLM). The dataset encompasses 29,778 prompts and responses from ChatGPT, including 19,106 code snippets, and is linked to corresponding software development artifacts such as source code, commits, issues, pull requests, discussions, and Hacker News threads. Th… ▽ More This paper introduces DevGPT, a dataset curated to explore how software developers interact with ChatGPT, a prominent large language model (LLM). The dataset encompasses 29,778 prompts and responses from ChatGPT, including 19,106 code snippets, and is linked to corresponding software development artifacts such as source code, commits, issues, pull requests, discussions, and Hacker News threads. This comprehensive dataset is derived from shared ChatGPT conversations collected from GitHub and Hacker News, providing a rich resource for understanding the dynamics of developer interactions with ChatGPT, the nature of their inquiries, and the impact of these interactions on their work. DevGPT enables the study of developer queries, the effectiveness of ChatGPT in code generation and problem solving, and the broader implications of AI-assisted programming. By providing this dataset, the paper paves the way for novel research avenues in software engineering, particularly in understanding and improving the use of LLMs like ChatGPT by developers. △ Less

Submitted 13 February, 2024; v1 submitted 31 August, 2023; originally announced September 2023.

Comments: MSR 2024 Mining Challenge Proposal

arXiv:2305.16591 [pdf, other]

doi 10.1007/s10664-023-10325-8

18 Million Links in Commit Messages: Purpose, Evolution, and Decay

Authors: Tao Xiao, Sebastian Baltes, Hideaki Hata, Christoph Treude, Raula Gaikovina Kula, Takashi Ishio, Kenichi Matsumoto

Abstract: Commit messages contain diverse and valuable types of knowledge in all aspects of software maintenance and evolution. Links are an example of such knowledge. Previous work on "9.6 million links in source code comments" showed that links are prone to decay, become outdated, and lack bidirectional traceability. We conducted a large-scale study of 18,201,165 links from commits in 23,110 GitHub reposi… ▽ More Commit messages contain diverse and valuable types of knowledge in all aspects of software maintenance and evolution. Links are an example of such knowledge. Previous work on "9.6 million links in source code comments" showed that links are prone to decay, become outdated, and lack bidirectional traceability. We conducted a large-scale study of 18,201,165 links from commits in 23,110 GitHub repositories to investigate whether they suffer the same fate. Results show that referencing external resources is prevalent and that the most frequent domains other than github.com are the external domains of Stack Overflow and Google Code. Similarly, links serve as source code context to commit messages, with inaccessible links being frequent. Although repeatedly referencing links is rare (4%), 14% of links that are prone to evolve become unavailable over time; e.g., tutorials or articles and software homepages become unavailable over time. Furthermore, we find that 70% of the distinct links suffer from decay; the domains that occur the most frequently are related to Subversion repositories. We summarize that links in commits share the same fate as links in code, opening up avenues for future work. △ Less

Submitted 25 May, 2023; originally announced May 2023.

Journal ref: Empir Software Eng 28, 91 (2023)

arXiv:2305.03251 [pdf, other]

Meta-Maintanance for Dockerfiles: Are We There Yet?

Authors: Takeru Tanaka, Hideaki Hata, Bodin Chinthanet, Raula Gaikovina Kula, Kenichi Matsumoto

Abstract: Docker allows for the packaging of applications and dependencies, and its instructions are described in Dockerfiles. Nowadays, version pinning is recommended to avoid unexpected changes in the latest version of a package. However, version pinning in Dockerfiles is not yet fully realized (only 17k of the 141k Dockerfiles we analyzed), because of the difficulties caused by version pinning. To mainta… ▽ More Docker allows for the packaging of applications and dependencies, and its instructions are described in Dockerfiles. Nowadays, version pinning is recommended to avoid unexpected changes in the latest version of a package. However, version pinning in Dockerfiles is not yet fully realized (only 17k of the 141k Dockerfiles we analyzed), because of the difficulties caused by version pinning. To maintain Dockerfiles with version-pinned packages, it is important to update package versions, not only for improved functionality, but also for software supply chain security, as packages are changed to address vulnerabilities and bug fixes. However, when updating multiple version-pinned packages, it is necessary to understand the dependencies between packages and ensure version compatibility, which is not easy. To address this issue, we explore the applicability of the meta-maintenance approach, which aims to distribute the successful updates in a part of a group that independently maintains a common artifact. We conduct an exploratory analysis of 7,914 repositories on GitHub that hold Dockerfiles, which retrieve packages on GitHub by URLs. There were 385 repository groups with the same multiple package combinations, and 208 groups had Dockerfiles with newer version combinations compared to others, which are considered meta-maintenance applicable. Our findings support the potential of meta-maintenance for updating multiple version-pinned packages and also reveal future challenges. △ Less

Submitted 4 May, 2023; originally announced May 2023.

Comments: 10 pages

arXiv:2303.15684 [pdf, other]

Understanding the Role of Images on Stack Overflow

Authors: Dong Wang, Tao Xiao, Christoph Treude, Raula Gaikovina Kula, Hideaki Hata, Yasutaka Kamei

Abstract: Images are increasingly being shared by software developers in diverse channels including question-and-answer forums like Stack Overflow. Although prior work has pointed out that these images are meaningful and provide complementary information compared to their associated text, how images are used to support questions is empirically unknown. To address this knowledge gap, in this paper we specifi… ▽ More Images are increasingly being shared by software developers in diverse channels including question-and-answer forums like Stack Overflow. Although prior work has pointed out that these images are meaningful and provide complementary information compared to their associated text, how images are used to support questions is empirically unknown. To address this knowledge gap, in this paper we specifically conduct an empirical study to investigate (I) the characteristics of images, (II) the extent to which images are used in different question types, and (III) the role of images on receiving answers. Our results first show that user interface is the most common image content and undesired output is the most frequent purpose for sharing images. Moreover, these images essentially facilitate the understanding of 68% of sampled questions. Second, we find that discrepancy questions are more relatively frequent compared to those without images, but there are no significant differences observed in description length in all types of questions. Third, the quantitative results statistically validate that questions with images are more likely to receive accepted answers, but do not speed up the time to receive answers. Our work demonstrates the crucial role that images play by approaching the topic from a new angle and lays the foundation for future opportunities to use images to assist in tasks like generating questions and identifying question-relatedness. △ Less

Submitted 27 March, 2023; originally announced March 2023.

arXiv:2303.10131 [pdf, ps, other]

She Elicits Requirements and He Tests: Software Engineering Gender Bias in Large Language Models

Authors: Christoph Treude, Hideaki Hata

Abstract: Implicit gender bias in software development is a well-documented issue, such as the association of technical roles with men. To address this bias, it is important to understand it in more detail. This study uses data mining techniques to investigate the extent to which 56 tasks related to software development, such as assigning GitHub issues and testing, are affected by implicit gender bias embed… ▽ More Implicit gender bias in software development is a well-documented issue, such as the association of technical roles with men. To address this bias, it is important to understand it in more detail. This study uses data mining techniques to investigate the extent to which 56 tasks related to software development, such as assigning GitHub issues and testing, are affected by implicit gender bias embedded in large language models. We systematically translated each task from English into a genderless language and back, and investigated the pronouns associated with each task. Based on translating each task 100 times in different permutations, we identify a significant disparity in the gendered pronoun associations with different tasks. Specifically, requirements elicitation was associated with the pronoun "he" in only 6% of cases, while testing was associated with "he" in 100% of cases. Additionally, tasks related to helping others had a 91% association with "he" while the same association for tasks related to asking coworkers was only 52%. These findings reveal a clear pattern of gender bias related to software development tasks and have important implications for addressing this issue both in the training of large language models and in broader society. △ Less

Submitted 17 March, 2023; originally announced March 2023.

Comments: 6 pages, MSR 2023

arXiv:2204.12712 [pdf, other]

Release as a Contract: A Concept of Meta-Maintenance for the Entire FLOSS Ecosystem

Authors: Hideaki Hata

Abstract: We advocate for a paradigm shift in supporting free/libre and open source software (FLOSS) ecosystem maintenance, from focusing on individual projects to monitoring a whole organic system of the entire FLOSS ecosystem, which we call software meta-maintenance. We discuss challenges of building a global source code management system, a global issue management system, and FLOSS human capital index, b… ▽ More We advocate for a paradigm shift in supporting free/libre and open source software (FLOSS) ecosystem maintenance, from focusing on individual projects to monitoring a whole organic system of the entire FLOSS ecosystem, which we call software meta-maintenance. We discuss challenges of building a global source code management system, a global issue management system, and FLOSS human capital index, based on the blockchain technologies. △ Less

Submitted 27 April, 2022; originally announced April 2022.

Comments: 4 pages

arXiv:2204.06531 [pdf, other]

Software Supply Chain Map: How Reuse Networks Expand

Authors: Hideaki Hata, Takashi Ishio

Abstract: Clone-and-own is a typical code reuse approach because of its simplicity and efficiency. Cloned software components are maintained independently by a new owner. These clone-and-own operations can be occurred sequentially, that is, cloned components can be cloned again and owned by other new owners on the supply chain. In general, code reuse is not documented well, consequently, appropriate changes… ▽ More Clone-and-own is a typical code reuse approach because of its simplicity and efficiency. Cloned software components are maintained independently by a new owner. These clone-and-own operations can be occurred sequentially, that is, cloned components can be cloned again and owned by other new owners on the supply chain. In general, code reuse is not documented well, consequently, appropriate changes like security patches cannot be propagated to descendant software projects. However, the OpenChain Project defined identifying and tracking source code reuses as responsibilities of FLOSS software staffs. Hence supporting source code reuse awareness is in a real need. This paper studies software reuse relations in FLOSS ecosystem. Technically, clone-and-own reuses of source code can be identified by file-level clone set detection. Since change histories are associated with files, we can determine origins and destinations in reusing across multiple software by considering times. By building software supply chain maps, we find that clone-and-own is prevalent in FLOSS development, and set of files are reused widely and repeatedly. These observations open up future challenges of maintaining and tracking global software genealogies. △ Less

Submitted 13 April, 2022; originally announced April 2022.

Comments: 12 pages, 14 figures

arXiv:2202.05751 [pdf, other]

doi 10.1145/3510003.3510116

GitHub Sponsors: Exploring a New Way to Contribute to Open Source

Authors: Naomichi Shimada, Tao Xiao, Hideaki Hata, Christoph Treude, Kenichi Matsumoto

Abstract: GitHub Sponsors, launched in 2019, enables donations to individual open source software (OSS) developers. Financial support for OSS maintainers and developers is a major issue in terms of sustaining OSS projects, and the ability to donate to individuals is expected to support the sustainability of developers, projects, and community. In this work, we conducted a mixed-methods study of GitHub Spons… ▽ More GitHub Sponsors, launched in 2019, enables donations to individual open source software (OSS) developers. Financial support for OSS maintainers and developers is a major issue in terms of sustaining OSS projects, and the ability to donate to individuals is expected to support the sustainability of developers, projects, and community. In this work, we conducted a mixed-methods study of GitHub Sponsors, including quantitative and qualitative analyses, to understand the characteristics of developers who are likely to receive donations and what developers think about donations to individuals. We found that: (1) sponsored developers are more active than non-sponsored developers, (2) the possibility to receive donations is related to whether there is someone in their community who is donating, and (3) developers are sponsoring as a new way to contribute to OSS. Our findings are the first step towards data-informed guidance for using GitHub Sponsors, opening up avenues for future work on this new way of financially sustaining the OSS community. △ Less

Submitted 11 February, 2022; originally announced February 2022.

Comments: 12 pages, ICSE 2022

arXiv:2109.02878 [pdf, other]

FixMe: A GitHub Bot for Detecting and Monitoring On-Hold Self-Admitted Technical Debt

Authors: Saranphon Phaithoon, Supakarn Wongnil, Patiphol Pussawong, Morakot Choetkiertikul, Chaiyong Ragkhitwetsagul, Thanwadee Sunetnanta, Rungroj Maipradit, Hideaki Hata, Kenichi Matsumoto

Abstract: Self-Admitted Technical Debt (SATD) is a special form of technical debt in which developers intentionally record their hacks in the code by adding comments for attention. Here, we focus on issue-related "On-hold SATD", where developers suspend proper implementation due to issues reported inside or outside the project. When the referenced issues are resolved, the On-hold SATD also need to be addres… ▽ More Self-Admitted Technical Debt (SATD) is a special form of technical debt in which developers intentionally record their hacks in the code by adding comments for attention. Here, we focus on issue-related "On-hold SATD", where developers suspend proper implementation due to issues reported inside or outside the project. When the referenced issues are resolved, the On-hold SATD also need to be addressed, but since monitoring these issue reports takes a lot of time and effort, developers may not be aware of the resolved issues and leave the On-hold SATD in the code. In this paper, we propose FixMe, a GitHub bot that helps developers detecting and monitoring On-hold SATD in their repositories and notify them whenever the On-hold SATDs are ready to be fixed (i.e. the referenced issues are resolved). The bot can automatically detect On-hold SATD comments from source code using machine learning techniques and discover referenced issues. When the referenced issues are resolved, developers will be notified by FixMe bot. The evaluation conducted with 11 participants shows that our FixMe bot can support them in dealing with On-hold SATD. FixMe is available at https://www.fixmebot.app/ and FixMe's VDO is at https://youtu.be/YSz9kFxN_YQ. △ Less

Submitted 7 September, 2021; originally announced September 2021.

Comments: 5 pages, ASE 2021

arXiv:2104.05891 [pdf, ps, other]

Science-Software Linkage: The Challenges of Traceability between Scientific Knowledge and Software Artifacts

Authors: Hideaki Hata, Jin L. C. Guo, Raula Gaikovina Kula, Christoph Treude

Abstract: Although computer science papers are often accompanied by software artifacts, connecting research papers to their software artifacts and vice versa is not always trivial. First of all, there is a lack of well-accepted standards for how such links should be provided. Furthermore, the provided links, if any, often become outdated: they are affected by link rot when pre-prints are removed, when repos… ▽ More Although computer science papers are often accompanied by software artifacts, connecting research papers to their software artifacts and vice versa is not always trivial. First of all, there is a lack of well-accepted standards for how such links should be provided. Furthermore, the provided links, if any, often become outdated: they are affected by link rot when pre-prints are removed, when repositories are migrated, or when papers and repositories evolve independently. In this paper, we summarize the state of the practice of linking research papers and associated source code, highlighting the recent efforts towards creating and maintaining such links. We also report on the results of several empirical studies focusing on the relationship between scientific papers and associated software artifacts, and we outline challenges related to traceability and opportunities for overcoming these challenges. △ Less

Submitted 12 April, 2021; originally announced April 2021.

Comments: 5 pages

arXiv:2102.09775 [pdf]

doi 10.1109/TSE.2021.3115772

Characterizing and Mitigating Self-Admitted Technical Debt in Build Systems

Authors: Tao Xiao, Dong Wang, Shane McIntosh, Hideaki Hata, Raula Gaikovina Kula, Takashi Ishio, Kenichi Matsumoto

Abstract: Technical Debt is a metaphor used to describe the situation in which long-term software artifact quality is traded for short-term goals in software projects. In recent years, the concept of self-admitted technical debt (SATD) was proposed, which focuses on debt that is intentionally introduced and described by developers. Although prior work has made important observations about admitted technical… ▽ More Technical Debt is a metaphor used to describe the situation in which long-term software artifact quality is traded for short-term goals in software projects. In recent years, the concept of self-admitted technical debt (SATD) was proposed, which focuses on debt that is intentionally introduced and described by developers. Although prior work has made important observations about admitted technical debt in source code, little is known about SATD in build systems. In this paper, we set out to better understand the characteristics of SATD in build systems. To do so, through a qualitative analysis of 500 SATD comments in the Maven build system of 291 projects, we characterize SATD by location and rationale (reason and purpose). Our results show that limitations in tools and libraries, and complexities of dependency management are the most frequent causes, accounting for 50% and 24% of the comments. We also find that developers often document SATD as issues to be fixed later. As a first step towards the automatic detection of SATD rationale, we train classifiers to detect the two most frequently occurring reasons and the four most frequently occurring purposes of SATD in the content of comments in Maven build systems. The classifier performance is promising, achieving an F1-score of 0.71-0.79. Finally, within 16 identified 'ready-to-be-addressed' SATD instances, the three SATD submitted by pull requests and the five SATD submitted by issue reports were resolved after developers were made aware. Our work presents the first step towards understanding technical debt in build systems and opens up avenues for future work, such as tool support to track and manage SATD backlogs. △ Less

Submitted 2 November, 2021; v1 submitted 19 February, 2021; originally announced February 2021.

arXiv:2102.06355 [pdf, other]

Same File, Different Changes: The Potential of Meta-Maintenance on GitHub

Authors: Hideaki Hata, Raula Gaikovina Kula, Takashi Ishio, Christoph Treude

Abstract: Online collaboration platforms such as GitHub have provided software developers with the ability to easily reuse and share code between repositories. With clone-and-own and forking becoming prevalent, maintaining these shared files is important, especially for keeping the most up-to-date version of reused code. Different to related work, we propose the concept of meta-maintenance -- i.e., tracking… ▽ More Online collaboration platforms such as GitHub have provided software developers with the ability to easily reuse and share code between repositories. With clone-and-own and forking becoming prevalent, maintaining these shared files is important, especially for keeping the most up-to-date version of reused code. Different to related work, we propose the concept of meta-maintenance -- i.e., tracking how the same files evolve in different repositories with the aim to provide useful maintenance opportunities to those files. We conduct an exploratory study by analyzing repositories from seven different programming languages to explore the potential of meta-maintenance. Our results indicate that a majority of active repositories on GitHub contains at least one file which is also present in another repository, and that a significant minority of these files are maintained differently in the different repositories which contain them. We manually analyzed a representative sample of shared files and their variants to understand which changes might be useful for meta-maintenance. Our findings support the potential of meta-maintenance and open up avenues for future work to capitalize on this potential. △ Less

Submitted 12 February, 2021; originally announced February 2021.

Comments: 12 pages, ICSE 2021

arXiv:2102.05230 [pdf, other]

GitHub Discussions: An Exploratory Study of Early Adoption

Authors: Hideaki Hata, Nicole Novielli, Sebastian Baltes, Raula Gaikovina Kula, Christoph Treude

Abstract: Discussions is a new feature of GitHub for asking questions or discussing topics outside of specific Issues or Pull Requests. Before being available to all projects in December 2020, it had been tested on selected open source software projects. To understand how developers use this novel feature, how they perceive it, and how it impacts the development processes, we conducted a mixed-methods study… ▽ More Discussions is a new feature of GitHub for asking questions or discussing topics outside of specific Issues or Pull Requests. Before being available to all projects in December 2020, it had been tested on selected open source software projects. To understand how developers use this novel feature, how they perceive it, and how it impacts the development processes, we conducted a mixed-methods study based on early adopters of GitHub discussions from January until July 2020. We found that: (1) errors, unexpected behavior, and code reviews are prevalent discussion categories; (2) there is a positive relationship between project member involvement and discussion frequency; (3) developers consider GitHub Discussions useful but face the problem of topic duplication between Discussions and Issues; (4) Discussions play a crucial role in advancing the development of projects; and (5) positive sentiment in Discussions is more frequent than in Stack Overflow posts. Our findings are a first step towards data-informed guidance for using GitHub Discussions, opening up avenues for future work on this novel communication channel. △ Less

Submitted 30 September, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

Comments: 37 pages, Empirical Software Engineering

arXiv:2102.01325 [pdf, ps, other]

FLOSS != GitHub: A Case Study of Linux/BSD Perceptions from Microsoft's Acquisition of GitHub

Authors: Raula Gaikovina Kula, Hideki Hata, Kenichi Matsumoto

Abstract: In 2018, the software industry giants Microsoft made a move into the Open Source world by completing the acquisition of mega Open Source platform, GitHub. This acquisition was not without controversy, as it is well-known that the free software communities includes not only the ability to use software freely, but also the libre nature in Open Source Software. In this study, our aim is to explore th… ▽ More In 2018, the software industry giants Microsoft made a move into the Open Source world by completing the acquisition of mega Open Source platform, GitHub. This acquisition was not without controversy, as it is well-known that the free software communities includes not only the ability to use software freely, but also the libre nature in Open Source Software. In this study, our aim is to explore these perceptions in FLOSS developers. We conducted a survey that covered traditional FLOSS source Linux, and BSD communities and received 246 developer responses. The results of the survey confirm that the free community did trigger some communities to move away from GitHub and raised discussions into free and open software on the GitHub platform. The study reminds us that although GitHub is influential and trendy, it does not representative all FLOSS communities. △ Less

Submitted 2 February, 2021; originally announced February 2021.

Comments: 5 pages

arXiv:2012.12466 [pdf, other]

doi 10.1007/s10515-022-00364-8

A Framework for Conditional Statement Technical Debt Identification and Description

Authors: Abdulaziz Alhefdhi, Hoa Khanh Dam, Yusuf Sulistyo Nugroho, Hideaki Hata, Takashi Ishio, Aditya Ghose

Abstract: Technical Debt occurs when development teams favour short-term operability over long-term stability. Since this places software maintainability at risk, technical debt requires early attention to avoid paying for accumulated interest. Most of the existing work focuses on detecting technical debt using code comments, known as Self-Admitted Technical Debt (SATD). However, there are many cases where… ▽ More Technical Debt occurs when development teams favour short-term operability over long-term stability. Since this places software maintainability at risk, technical debt requires early attention to avoid paying for accumulated interest. Most of the existing work focuses on detecting technical debt using code comments, known as Self-Admitted Technical Debt (SATD). However, there are many cases where technical debt instances are not explicitly acknowledged but deeply hidden in the code. In this paper, we propose a framework that caters for the absence of SATD comments in code. Our Self-Admitted Technical Debt Identification and Description (SATDID) framework determines if technical debt should be self-admitted for an input code fragment. If that is the case, SATDID will automatically generate the appropriate descriptive SATD comment that can be attached with the code. While our approach is applicable in principle to any type of code fragments, we focus in this study on technical debt hidden in conditional statements, one of the most TD-carrying parts of code. We explore and evaluate different implementations of SATDID. The evaluation results demonstrate the applicability and effectiveness of our framework over multiple benchmarks. Comparing with the results from the benchmarks, our approach provides at least 21.35%, 59.36%, 31.78%, and 583.33% improvements in terms of Precision, Recall, F-1, and Bleu-4 scores, respectively. In addition, we conduct human evaluation to the SATD comments generated by SATDID. In 1-5 and 0-5 scales for Acceptability and Understandability, the total means achieved by our approach are 3.128 and 3.172, respectively. △ Less

Submitted 13 October, 2022; v1 submitted 22 December, 2020; originally announced December 2020.

Journal ref: Autom Softw Eng 29, 60 (2022)

arXiv:2009.13113 [pdf, other]

Automated Identification of On-hold Self-admitted Technical Debt

Authors: Rungroj Maipradit, Bin Lin, Csaba Nagy, Gabriele Bavota, Michele Lanza, Hideaki Hata, Kenichi Matsumoto

Abstract: Modern software is developed under considerable time pressure, which implies that developers more often than not have to resort to compromises when it comes to code that is well written and code that just does the job. This has led over the past decades to the concept of "technical debt", a short-term hack that potentially generates long-term maintenance problems. Self-admitted technical debt (SAT… ▽ More Modern software is developed under considerable time pressure, which implies that developers more often than not have to resort to compromises when it comes to code that is well written and code that just does the job. This has led over the past decades to the concept of "technical debt", a short-term hack that potentially generates long-term maintenance problems. Self-admitted technical debt (SATD) is a particular form of technical debt: developers consciously perform the hack but also document it in the code by adding comments as a reminder (or as an admission of guilt). We focus on a specific type of SATD, namely "On-hold" SATD, in which developers document in their comments the need to halt an implementation task due to conditions outside of their scope of work (e.g., an open issue must be closed before a function can be implemented). We present an approach, based on regular expressions and machine learning, which is able to detect issues referenced in code comments, and to automatically classify the detected instances as either "On-hold" (the issue is referenced to indicate the need to wait for its resolution before completing a task), or as "cross-reference", (the issue is referenced to document the code, for example to explain the rationale behind an implementation choice). Our approach also mines the issue tracker of the projects to check if the On-hold SATD instances are "superfluous" and can be removed (i.e., the referenced issue has been closed, but the SATD is still in the code). Our evaluation confirms that our approach can indeed identify relevant instances of On-hold SATD. We illustrate its usefulness by identifying superfluous On-hold SATD instances in open source projects as confirmed by the original developers. △ Less

Submitted 28 September, 2020; originally announced September 2020.

Comments: 11 pages, 20th IEEE International Working Conference on Source Code Analysis and Manipulation

arXiv:2009.09130 [pdf, other]

How are Project-Specific Forums Utilized? A Study of Participation, Content, and Sentiment in the Eclipse Ecosystem

Authors: Yusuf Sulistyo Nugroho, Syful Islam, Keitaro Nakasai, Ifraz Rehman, Hideaki Hata, Raula Gaikovina Kula, Meiyappan Nagappan, Kenichi Matsumoto

Abstract: Although many software development projects have moved their developer discussion forums to generic platforms such as Stack Overflow, Eclipse has been steadfast in hosting their self-supported community forums. While recent studies show forums share similarities to generic communication channels, it is unknown how project-specific forums are utilized. In this paper, we analyze 832,058 forum thread… ▽ More Although many software development projects have moved their developer discussion forums to generic platforms such as Stack Overflow, Eclipse has been steadfast in hosting their self-supported community forums. While recent studies show forums share similarities to generic communication channels, it is unknown how project-specific forums are utilized. In this paper, we analyze 832,058 forum threads and their linkages to four systems with 2,170 connected contributors to understand the participation, content and sentiment. Results show that Seniors are the most active participants to respond bug and non-bug-related threads in the forums (i.e., 66.1% and 45.5%), and sentiment among developers are inconsistent while knowledge sharing within Eclipse. We recommend the users to identify appropriate topics and ask in a positive procedural way when joining forums. For developers, preparing project-specific forums could be an option to bridge the communication between members. Irrespective of the popularity of Stack Overflow, we argue the benefits of using project-specific forum initiatives, such as GitHub Discussions, are needed to cultivate a community and its ecosystem. △ Less

Submitted 5 August, 2021; v1 submitted 18 September, 2020; originally announced September 2020.

Comments: 33 pages, 7 figures

arXiv:2009.03612 [pdf, other]

doi 10.1109/TSE.2020.3023177

Predicting Defective Lines Using a Model-Agnostic Technique

Authors: Supatsara Wattanakriengkrai, Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Hideaki Hata, Kenichi Matsumoto

Abstract: Defect prediction models are proposed to help a team prioritize source code areas files that need Software QualityAssurance (SQA) based on the likelihood of having defects. However, developers may waste their unnecessary effort on the whole filewhile only a small fraction of its source code lines are defective. Indeed, we find that as little as 1%-3% of lines of a file are defective. Hence, in thi… ▽ More Defect prediction models are proposed to help a team prioritize source code areas files that need Software QualityAssurance (SQA) based on the likelihood of having defects. However, developers may waste their unnecessary effort on the whole filewhile only a small fraction of its source code lines are defective. Indeed, we find that as little as 1%-3% of lines of a file are defective. Hence, in this work, we propose a novel framework (called LINE-DP) to identify defective lines using a model-agnostic technique, i.e., an Explainable AI technique that provides information why the model makes such a prediction. Broadly speaking, our LINE-DP first builds a file-level defect model using code token features. Then, our LINE-DP uses a state-of-the-art model-agnostic technique (i.e.,LIME) to identify risky tokens, i.e., code tokens that lead the file-level defect model to predict that the file will be defective. Then, the lines that contain risky tokens are predicted as defective lines. Through a case study of 32 releases of nine Java open source systems, our evaluation results show that our LINE-DP achieves an average recall of 0.61, a false alarm rate of 0.47, a top 20%LOC recall of0.27, and an initial false alarm of 16, which are statistically better than six baseline approaches. Our evaluation shows that our LINE-DP requires an average computation time of 10 seconds including model construction and defective line identification time. In addition, we find that 63% of defective lines that can be identified by our LINE-DP are related to common defects (e.g., argument change, condition change). These results suggest that our LINE-DP can effectively identify defective lines that contain common defectswhile requiring a smaller amount of inspection effort and a manageable computation cost. △ Less

Submitted 8 September, 2020; originally announced September 2020.

arXiv:2006.14185 [pdf, ps, other]

Optimizing Affine Maximizer Auctions via Linear Programming: an Application to Revenue Maximizing Mechanism Design for Zero-Day Exploits Markets

Authors: Mingyu Guo, Hideaki Hata, Ali Babar

Abstract: Optimizing within the affine maximizer auctions (AMA) is an effective approach for revenue maximizing mechanism design. The AMA mechanisms are strategy-proof and individually rational (if the agents' valuations for the outcomes are nonnegative). Every AMA mechanism is characterized by a list of parameters. By focusing on the AMA mechanisms, we turn mechanism design into a value optimization proble… ▽ More Optimizing within the affine maximizer auctions (AMA) is an effective approach for revenue maximizing mechanism design. The AMA mechanisms are strategy-proof and individually rational (if the agents' valuations for the outcomes are nonnegative). Every AMA mechanism is characterized by a list of parameters. By focusing on the AMA mechanisms, we turn mechanism design into a value optimization problem, where we only need to adjust the parameters. We propose a linear programming based heuristic for optimizing within the AMA family. We apply our technique to revenue maximizing mechanism design for zero-day exploit markets. We show that due to the nature of the zero-day exploit markets, if there are only two agents (one offender and one defender), then our technique generally produces a near optimal mechanism: the mechanism's expected revenue is close to the optimal revenue achieved by the optimal strategy-proof and individually rational mechanism (not necessarily an AMA mechanism). △ Less

Submitted 25 June, 2020; originally announced June 2020.

Journal ref: PRIMA 2017: Principles and Practice of Multi-Agent Systems

arXiv:2006.14184 [pdf, ps, other]

Revenue Maximizing Markets for Zero-Day Exploits

Authors: Mingyu Guo, Hideaki Hata, Ali Babar

Abstract: Markets for zero-day exploits (software vulnerabilities unknown to the vendor) have a long history and a growing popularity. We study these markets from a revenue-maximizing mechanism design perspective. We first propose a theoretical model for zero-day exploits markets. In our model, one exploit is being sold to multiple buyers. There are two kinds of buyers, which we call the defenders and the o… ▽ More Markets for zero-day exploits (software vulnerabilities unknown to the vendor) have a long history and a growing popularity. We study these markets from a revenue-maximizing mechanism design perspective. We first propose a theoretical model for zero-day exploits markets. In our model, one exploit is being sold to multiple buyers. There are two kinds of buyers, which we call the defenders and the offenders. The defenders are buyers who buy vulnerabilities in order to fix them (e.g., software vendors). The offenders, on the other hand, are buyers who intend to utilize the exploits (e.g., national security agencies and police). Our model is more than a single-item auction. First, an exploit is a piece of information, so one exploit can be sold to multiple buyers. Second, buyers have externalities. If one defender wins, then the exploit becomes worthless to the offenders. Third, if we disclose the details of the exploit to the buyers before the auction, then they may leave with the information without paying. On the other hand, if we do not disclose the details, then it is difficult for the buyers to come up with their private valuations. Considering the above, our proposed mechanism discloses the details of the exploit to all offenders before the auction. The offenders then pay to delay the exploit being disclosed to the defenders. △ Less

Submitted 25 June, 2020; originally announced June 2020.

Journal ref: PRIMA 2016: Principles and Practice of Multi-Agent Systems

arXiv:2005.01127 [pdf, other]

doi 10.1007/s10664-020-09875-y

Pandemic Programming: How COVID-19 affects software developers and how their organizations can help

Authors: Paul Ralph, Sebastian Baltes, Gianisa Adisaputri, Richard Torkar, Vladimir Kovalenko, Marcos Kalinowski, Nicole Novielli, Shin Yoo, Xavier Devroey, Xin Tan, Minghui Zhou, Burak Turhan, Rashina Hoda, Hideaki Hata, Gregorio Robles, Amin Milani Fard, Rana Alkadhi

Abstract: Context. As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions. Objective. This study investigates the effects of the pandemic on developers' wellbeing and productivity. Method. A questionnaire survey was created mainly from existing, validated scales and translated into… ▽ More Context. As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions. Objective. This study investigates the effects of the pandemic on developers' wellbeing and productivity. Method. A questionnaire survey was created mainly from existing, validated scales and translated into 12 languages. The data was analyzed using non-parametric inferential statistics and structural equation modeling. Results. The questionnaire received 2225 usable responses from 53 countries. Factor analysis supported the validity of the scales and the structural model achieved a good fit (CFI = 0.961, RMSEA = 0.051, SRMR = 0.067). Confirmatory results include: (1) the pandemic has had a negative effect on developers' wellbeing and productivity; (2) productivity and wellbeing are closely related; (3) disaster preparedness, fear related to the pandemic and home office ergonomics all affect wellbeing or productivity. Exploratory analysis suggests that: (1) women, parents and people with disabilities may be disproportionately affected; (2) different people need different kinds of support. Conclusions. To improve employee productivity, software companies should focus on maximizing employee wellbeing and improving the ergonomics of employees' home offices. Women, parents and disabled persons may require extra support. △ Less

Submitted 20 July, 2020; v1 submitted 3 May, 2020; originally announced May 2020.

Comments: 34 pages, 7 tables, 5 figures, to appear in Empirical Software Engineering

Journal ref: Empirical Software Engineering, 2020

arXiv:2004.00199 [pdf, other]

GitHub Repositories with Links to Academic Papers: Public Access, Traceability, and Evolution

Authors: Supatsara Wattanakriengkrai, Bodin Chinthanet, Hideaki Hata, Raula Gaikovina Kula, Christoph Treude, Jin Guo, Kenichi Matsumoto

Abstract: Traceability between published scientific breakthroughs and their implementation is essential, especially in the case of open-source scientific software which implements bleeding-edge science in its code. However, aligning the link between GitHub repositories and academic papers can prove difficult, and the current practice of establishing and maintaining such links remains unknown. This paper inv… ▽ More Traceability between published scientific breakthroughs and their implementation is essential, especially in the case of open-source scientific software which implements bleeding-edge science in its code. However, aligning the link between GitHub repositories and academic papers can prove difficult, and the current practice of establishing and maintaining such links remains unknown. This paper investigates the role of academic paper references contained in these repositories. We conduct a large-scale study of 20 thousand GitHub repositories that make references to academic papers. We use a mixed-methods approach to identify public access, traceability and evolutionary aspects of the links. Although referencing a paper is not typical, we find that a vast majority of referenced academic papers are public access. These repositories tend to be affiliated with academic communities. More than half of the papers do not link back to any repository. We find that academic papers from top-tier SE venues are not likely to reference a repository, but when they do, they usually link to a GitHub software repository. In a network of arXiv papers and referenced repositories, we find that the most referenced papers are (i) highly-cited in academia and (ii) are referenced by repositories written in different programming languages. △ Less

Submitted 3 October, 2021; v1 submitted 31 March, 2020; originally announced April 2020.

Comments: 28 pages

arXiv:2001.09630 [pdf, other]

doi 10.1007/s10664-020-09807-w

Ammonia: An Approach for Deriving Project-specific Bug Patterns

Authors: Yoshiki Higo, Shinpei Hayashi, Hideaki Hata, Meiyappan Nagappan

Abstract: Finding and fixing buggy code is an important and cost-intensive maintenance task, and static analysis (SA) is one of the methods developers use to perform it. SA tools warn developers about potential bugs by scanning their source code for commonly occurring bug patterns, thus giving those developers opportunities to fix the warnings (potential bugs) before they release the software. Typically, SA… ▽ More Finding and fixing buggy code is an important and cost-intensive maintenance task, and static analysis (SA) is one of the methods developers use to perform it. SA tools warn developers about potential bugs by scanning their source code for commonly occurring bug patterns, thus giving those developers opportunities to fix the warnings (potential bugs) before they release the software. Typically, SA tools scan for general bug patterns that are common to any software project (such as null pointer dereference), and not for project specific patterns. However, past research has pointed to this lack of customizability as a severe limiting issue in SA. Accordingly, in this paper, we propose an approach called Ammonia, which is based on statically analyzing changes across the development history of a project, as a means to identify project-specific bug patterns. Furthermore, the bug patterns identified by our tool do not relate to just one developer or one specific commit, they reflect the project as a whole and compliment the warnings from other SA tools that identify general bug patterns. Herein, we report on the application of our implemented tool and approach to four Java projects: Ant, Camel, POI, and Wicket. The results obtained show that our tool could detect 19 project specific bug patterns across those four projects. Next, through manual analysis, we determined that six of those change patterns were actual bugs and submitted pull requests based on those bug patterns. As a result, five of the pull requests were merged. △ Less

Submitted 14 March, 2020; v1 submitted 27 January, 2020; originally announced January 2020.

Comments: 28 pages, Empirical Software Engineering

Journal ref: Empirical Software Engineering, 25(3):1951-1979, 2020

arXiv:1911.04016 [pdf, other]

doi 10.1109/MS.2021.3098116

Challenges for Inclusion in Software Engineering: The Case of the Emerging Papua New Guinean Society

Authors: Raula Gaikovina Kula, Christoph Treude, Hideaki Hata, Sebastian Baltes, Igor Steinmacher, Marco Aurelio Gerosa, Winifred Kula Amini

Abstract: Software plays a central role in modern societies, with its high economic value and potential for advancing societal change. In this paper, we characterise challenges and opportunities for a country progressing towards entering the global software industry, focusing on Papua New Guinea (PNG). By hosting a Software Engineering workshop, we conducted a qualitative study by recording talks (n=3), emp… ▽ More Software plays a central role in modern societies, with its high economic value and potential for advancing societal change. In this paper, we characterise challenges and opportunities for a country progressing towards entering the global software industry, focusing on Papua New Guinea (PNG). By hosting a Software Engineering workshop, we conducted a qualitative study by recording talks (n=3), employing a questionnaire (n=52), and administering an in-depth focus group session with local actors (n=5). Based on a thematic analysis, we identified challenges as barriers and opportunities for the PNG software engineering community. We also discuss the state of practices and how to make it inclusive for practitioners, researchers, and educators from both the local and global software engineering community. △ Less

Submitted 22 July, 2021; v1 submitted 31 October, 2019; originally announced November 2019.

Comments: IEEE Software

Journal ref: IEEE Software (2021)

arXiv:1910.06932 [pdf, other]

From Academia to Software Development: Publication Citations in Source Code Comments

Authors: Akira Inokuchi, Yusuf Sulistyo Nugroho, Supatsara Wattanakriengkrai, Fumiaki Konishi, Hideaki Hata, Christoph Treude, Akito Monden, Kenichi Matsumoto

Abstract: Academic publications have been evaluated in terms of their impact on research communities based on many metrics, such as the number of citations. On the other hand, the impact of academic publications on industry has been rarely studied. This paper investigates how academic publications contribute to software development by analyzing publication citations in source code comments in open source so… ▽ More Academic publications have been evaluated in terms of their impact on research communities based on many metrics, such as the number of citations. On the other hand, the impact of academic publications on industry has been rarely studied. This paper investigates how academic publications contribute to software development by analyzing publication citations in source code comments in open source software repositories. We propose an automated approach for detecting academic publications based on Named Entity Recognition, and achieve 0.90 in $F_1$ as detection accuracy. We conduct a large-scale study of publication citations with 319,438,977 comments collected from 25,925 active repositories written in seven programming languages. Our findings indicate that academic publications can be knowledge sources for software development. These referenced publications are particularly from journals. In terms of knowledge transfer, algorithm is the most prevalent type of knowledge transferred from the publications, with proposed formulas or equations typically implemented in methods or functions in source code files. In a closer look at GitHub repositories referencing academic publications, we find that science-related repositories are the most frequent among GitHub repositories with publication citations, and that the vast majority of these publications are referenced by repository owners who are different from the publication authors. We also find that referencing older publications can lead to potential issues related to obsolete knowledge. △ Less

Submitted 1 May, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

Comments: 33 pages

arXiv:1907.06182 [pdf, other]

doi 10.1109/APSIPAASC47483.2019.9023036

Towards Generation of Visual Attention Map for Source Code

Authors: Takeshi D. Itoh, Takatomi Kubo, Kiyoka Ikeda, Yuki Maruno, Yoshiharu Ikutani, Hideaki Hata, Kenichi Matsumoto, Kazushi Ikeda

Abstract: Program comprehension is a dominant process in software development and maintenance. Experts are considered to comprehend the source code efficiently by directing their gaze, or attention, to important components in it. However, reflecting the importance of components is still a remaining issue in gaze behavior analysis for source code comprehension. Here we show a conceptual framework to compare… ▽ More Program comprehension is a dominant process in software development and maintenance. Experts are considered to comprehend the source code efficiently by directing their gaze, or attention, to important components in it. However, reflecting the importance of components is still a remaining issue in gaze behavior analysis for source code comprehension. Here we show a conceptual framework to compare the quantified importance of source code components with the gaze behavior of programmers. We use "attention" in attention models (e.g., code2vec) as the importance indices for source code components and evaluate programmers' gaze locations based on the quantified importance. In this report, we introduce the idea of our gaze behavior analysis using the attention map, and the results of a preliminary experiment. △ Less

Submitted 13 August, 2019; v1 submitted 14 July, 2019; originally announced July 2019.

Comments: 4 pages, 2 figures; APSIPA 2019 ACCEPTED

Journal ref: APSIPA ASC (2019)

arXiv:1907.04557 [pdf, other]

Identifying Algorithm Names in Code Comments

Authors: Jakapong Klainongsuang, Yusuf Sulistyo Nugroho, Hideaki Hata, Bundit Manaskasemsak, Arnon Rungsawang, Pattara Leelaprute, Kenichi Matsumoto

Abstract: For recent machine-learning-based tasks like API sequence generation, comment generation, and document generation, large amount of data is needed. When software developers implement algorithms in code, we find that they often mention algorithm names in code comments. Code annotated with such algorithm names can be valuable data sources. In this paper, we propose an automatic method of algorithm na… ▽ More For recent machine-learning-based tasks like API sequence generation, comment generation, and document generation, large amount of data is needed. When software developers implement algorithms in code, we find that they often mention algorithm names in code comments. Code annotated with such algorithm names can be valuable data sources. In this paper, we propose an automatic method of algorithm name identification. The key idea is extracting important N-gram words containing the word `algorithm' in the last. We also consider part of speech patterns to derive rules for appropriate algorithm name identification. The result of our rule evaluation produced high precision and recall values (more than 0.70). We apply our rules to extract algorithm names in a large amount of comments from active FLOSS projects written in seven programming languages, C, C++, Java, JavaScript, Python, PHP, and Ruby, and report commonly mentioned algorithm names in code comments. △ Less

Submitted 10 July, 2019; originally announced July 2019.

Comments: 10 pages

arXiv:1905.03593 [pdf, other]

doi 10.1016/j.jss.2019.110416

A Topological Analysis of Communication Channels for Knowledge Sharing in Contemporary GitHub Projects

Authors: Jirateep Tantisuwankul, Yusuf Sulistyo Nugroho, Raula Gaikovina Kula, Hideaki Hata, Arnon Rungsawang, Pattara Leelaprute, Kenichi Matsumoto

Abstract: With over 28 million developers, success of the GitHub collaborative platform is highlighted through an abundance of communication channels among contemporary software projects. Knowledge is broken into two forms and its sharing (through communication channels) can be described as externalization or combination by the SECI model. Such platforms have revolutionized the way developers work, introduc… ▽ More With over 28 million developers, success of the GitHub collaborative platform is highlighted through an abundance of communication channels among contemporary software projects. Knowledge is broken into two forms and its sharing (through communication channels) can be described as externalization or combination by the SECI model. Such platforms have revolutionized the way developers work, introducing new channels to share knowledge in the form of pull requests, issues and wikis. It is unclear how these channels capture and share knowledge. In this research, our goal is to analyze these communication channels in GitHub. First, using the SECI model, we are able to map how knowledge is shared through the communication channels. Then in a large-scale topology analysis of seven library package projects (i.e., involving over 70 thousand projects), we extracted insights of the different communication channels within GitHub. Using two research questions, we explored the evolution of the channels and adoption of channels by both popular and unpopular library package projects. Results show that (i) contemporary GitHub Projects tend to adopt multiple communication channels, (ii) communication channels change over time and (iii) communication channels are used to both capture new knowledge (i.e., externalization) and updating existing knowledge (i.e., combination). △ Less

Submitted 8 September, 2019; v1 submitted 9 May, 2019; originally announced May 2019.

Comments: 30 pages

arXiv:1904.12162 [pdf, other]

Sentiment Classification using N-gram IDF and Automated Machine Learning

Authors: Rungroj Maipradit, Hideaki Hata, Kenichi Matsumoto

Abstract: We propose a sentiment classification method with a general machine learning framework. For feature representation, n-gram IDF is used to extract software-engineering-related, dataset-specific, positive, neutral, and negative n-gram expressions. For classifiers, an automated machine learning tool is used. In the comparison using publicly available datasets, our method achieved the highest F1 value… ▽ More We propose a sentiment classification method with a general machine learning framework. For feature representation, n-gram IDF is used to extract software-engineering-related, dataset-specific, positive, neutral, and negative n-gram expressions. For classifiers, an automated machine learning tool is used. In the comparison using publicly available datasets, our method achieved the highest F1 values in positive and negative sentences on all datasets. △ Less

Submitted 25 May, 2019; v1 submitted 27 April, 2019; originally announced April 2019.

Comments: 4 pages, IEEE Software

arXiv:1903.06320 [pdf, other]

Toward Imitating Visual Attention of Experts in Software Development Tasks

Authors: Yoshiharu Ikutani, Nishanth Koganti, Hideaki Hata, Takatomi Kubo, Kenichi Matsumoto

Abstract: Expert programmers' eye-movements during source code reading are valuable sources that are considered to be associated with their domain expertise. We advocate a vision of new intelligent systems incorporating expertise of experts for software development tasks, such as issue localization, comment generation, and code generation. We present a conceptual framework of neural autonomous agents based… ▽ More Expert programmers' eye-movements during source code reading are valuable sources that are considered to be associated with their domain expertise. We advocate a vision of new intelligent systems incorporating expertise of experts for software development tasks, such as issue localization, comment generation, and code generation. We present a conceptual framework of neural autonomous agents based on imitation learning (IL), which enables agents to mimic the visual attention of an expert via his/her eye movement. In this framework, an autonomous agent is constructed as a context-based attention model that consists of encoder/decoder network and trained with state-action sequences generated by an experts' demonstration. Challenges to implement an IL-based autonomous agent specialized for software development task are discussed in this paper. △ Less

Submitted 14 March, 2019; originally announced March 2019.

Comments: 4 pages, EMIP 2019

arXiv:1902.02467 [pdf, other]

doi 10.1007/s10664-019-09772-z

How Different Are Different diff Algorithms in Git?

Authors: Yusuf Sulistyo Nugroho, Hideaki Hata, Kenichi Matsumoto

Abstract: Automatic identification of the differences between two versions of a file is a common and basic task in several applications of mining code repositories. Git, a version control system, has a diff utility and users can select algorithms of diff from the default algorithm Myers to the advanced Histogram algorithm. From our systematic mapping, we identified three popular applications of diff in rece… ▽ More Automatic identification of the differences between two versions of a file is a common and basic task in several applications of mining code repositories. Git, a version control system, has a diff utility and users can select algorithms of diff from the default algorithm Myers to the advanced Histogram algorithm. From our systematic mapping, we identified three popular applications of diff in recent studies. On the impact on code churn metrics in 14 Java projects, we obtained different values in 1.7% to 8.2% commits based on the different diff algorithms. Regarding bug-introducing change identification, we found 6.0% and 13.3% in the identified bug-fix commits had different results of bug-introducing changes from 10 Java projects. For patch application, we found that the Histogram is more suitable than Myers for providing the changes of code, from our manual analysis. Thus, we strongly recommend using the Histogram algorithm when mining Git repositories to consider differences in source code. △ Less

Submitted 16 October, 2019; v1 submitted 6 February, 2019; originally announced February 2019.

Comments: 38 pages, Empirical Software Engineering

arXiv:1901.09511 [pdf, other]

Wait For It: Identifying "On-Hold" Self-Admitted Technical Debt

Authors: Rungroj Maipradit, Christoph Treude, Hideaki Hata, Kenichi Matsumoto

Abstract: Self-admitted technical debt refers to situations where a software developer knows that their current implementation is not optimal and indicates this using a source code comment. In this work, we hypothesize that it is possible to develop automated techniques to understand a subset of these comments in more detail, and to propose tool support that can help developers manage self-admitted technica… ▽ More Self-admitted technical debt refers to situations where a software developer knows that their current implementation is not optimal and indicates this using a source code comment. In this work, we hypothesize that it is possible to develop automated techniques to understand a subset of these comments in more detail, and to propose tool support that can help developers manage self-admitted technical debt more effectively. Based on a qualitative study of 335 comments indicating self-admitted technical debt, we first identify one particular class of debt amenable to automated management: "on-hold" self-admitted technical debt, i.e., debt which contains a condition to indicate that a developer is waiting for a certain event or an updated functionality having been implemented elsewhere. We then design and evaluate an automated classifier which can identify these "on-hold" instances with an area under the receiver operating characteristic curve (AUC) of 0.83 as well as detect the specific conditions that developers are waiting for. Our work presents a first step towards automated tool support that is able to indicate when certain instances of self-admitted technical debt are ready to be addressed. △ Less

Submitted 21 October, 2019; v1 submitted 27 January, 2019; originally announced January 2019.

Comments: 33 pages

arXiv:1901.07440 [pdf, other]

9.6 Million Links in Source Code Comments: Purpose, Evolution, and Decay

Authors: Hideaki Hata, Christoph Treude, Raula Gaikovina Kula, Takashi Ishio

Abstract: Links are an essential feature of the World Wide Web, and source code repositories are no exception. However, despite their many undisputed benefits, links can suffer from decay, insufficient versioning, and lack of bidirectional traceability. In this paper, we investigate the role of links contained in source code comments from these perspectives. We conducted a large-scale study of around 9.6 mi… ▽ More Links are an essential feature of the World Wide Web, and source code repositories are no exception. However, despite their many undisputed benefits, links can suffer from decay, insufficient versioning, and lack of bidirectional traceability. In this paper, we investigate the role of links contained in source code comments from these perspectives. We conducted a large-scale study of around 9.6 million links to establish their prevalence, and we used a mixed-methods approach to identify the links' targets, purposes, decay, and evolutionary aspects. We found that links are prevalent in source code repositories, that licenses, software homepages, and specifications are common types of link targets, and that links are often included to provide metadata or attribution. Links are rarely updated, but many link targets evolve. Almost 10% of the links included in source code comments are dead. We then submitted a batch of link-fixing pull requests to open source software repositories, resulting in most of our fixes being merged successfully. Our findings indicate that links in source code comments can indeed be fragile, and our work opens up avenues for future work to address these problems. △ Less

Submitted 15 February, 2019; v1 submitted 22 January, 2019; originally announced January 2019.

Comments: 12 pages, ICSE 2019

arXiv:1812.07170 [pdf, other]

Learning to Generate Corrective Patches using Neural Machine Translation

Authors: Hideaki Hata, Emad Shihab, Graham Neubig

Abstract: Bug fixing is generally a manually-intensive task. However, recent work has proposed the idea of automated program repair, which aims to repair (at least a subset of) bugs in different ways such as code mutation, etc. Following in the same line of work as automated bug repair, in this paper we aim to leverage past fixes to propose fixes of current/future bugs. Specifically, we propose Ratchet, a c… ▽ More Bug fixing is generally a manually-intensive task. However, recent work has proposed the idea of automated program repair, which aims to repair (at least a subset of) bugs in different ways such as code mutation, etc. Following in the same line of work as automated bug repair, in this paper we aim to leverage past fixes to propose fixes of current/future bugs. Specifically, we propose Ratchet, a corrective patch generation system using neural machine translation. By learning corresponding pre-correction and post-correction code in past fixes with a neural sequence-to-sequence model, Ratchet is able to generate a fix code for a given bug-prone code query. We perform an empirical study with five open source projects, namely Ambari, Camel, Hadoop, Jetty and Wicket, to evaluate the effectiveness of Ratchet. Our findings show that Ratchet can generate syntactically valid statements 98.7% of the time, and achieve an F1-measure between 0.29 - 0.83 with respect to the actual fixes adopted in the code base. In addition, we perform a qualitative validation using 20 participants to see whether the generated statements can be helpful in correcting bugs. Our survey showed that Ratchet's output was considered to be helpful in fixing the bugs on many occasions, even if fix was not 100% correct. △ Less

Submitted 3 July, 2019; v1 submitted 17 December, 2018; originally announced December 2018.

Comments: 20 pages

arXiv:1805.03844 [pdf, other]

Human Capital in Software Engineering: A Systematic Mapping of Reconceptualized Human Aspect Studies

Authors: Saya Onoue, Hideaki Hata, Raula Gaikovina Kula, Kenichi Matsumoto

Abstract: The human capital invested into software development plays a vital role in the success of any software project. By human capital, we do not mean the individuals themselves, but involves the range of knowledge and skills (i.e., human aspects) invested to create value during development. However, there is still no consensus on how these broad terms of human aspects relate to the health of a project.… ▽ More The human capital invested into software development plays a vital role in the success of any software project. By human capital, we do not mean the individuals themselves, but involves the range of knowledge and skills (i.e., human aspects) invested to create value during development. However, there is still no consensus on how these broad terms of human aspects relate to the health of a project. In this study, we reconceptualize human aspects of software engineering (SE) into a framework (i.e., SE human capital). The study presents a systematic mapping to survey and classify existing human aspect studies into four dimensions of the framework: capacity, deployment, development, and know-how (based on the Global Human Capital Index). From premium SE publishing venues (five journal articles and four conferences), we extract 2,698 hits of papers published between 2013 to 2017. Using a search criteria, we then narrow our results to 340 papers. Finally, we use inclusion and exclusion criteria to manually select 78 papers (49 quantitative and 29 qualitative studies). Using research questions, we uncover related topics, theories and data origins. The key outcome of this paper is a set of indicators for SE human capital. This work is towards the creation of a SE Human Capital Index (SE-HCI) to capture and rank human aspects, with the potential to assess progress within projects, and point to opportunities for cross-project learning and exchange across software projects. △ Less

Submitted 10 May, 2018; originally announced May 2018.

Comments: 35 pages

arXiv:1803.04129 [pdf, other]

Are Donation Badges Appealing? A Case Study of Developer Responses to Eclipse Bug Reports

Authors: Keitaro Nakasai, Hideaki Hata, Kenichi Matsumoto

Abstract: Eclipse, an open source software project, acknowledges its donors by presenting donation badges in its issue tracking system Bugzilla. However, the rewarding effect of this strategy is currently unknown. We applied a framework of causal inference to investigate relative promptness of developer response to bug reports with donation badges compared with bug reports without the badges, and estimated… ▽ More Eclipse, an open source software project, acknowledges its donors by presenting donation badges in its issue tracking system Bugzilla. However, the rewarding effect of this strategy is currently unknown. We applied a framework of causal inference to investigate relative promptness of developer response to bug reports with donation badges compared with bug reports without the badges, and estimated that donation badges decreases developer response time by a median time of about two hours. The appearance of donation badges is appealing for both donors and organizers because of its practical, rewarding and yet inexpensive effect. △ Less

Submitted 18 July, 2018; v1 submitted 12 March, 2018; originally announced March 2018.

Comments: 4 pages, IEEE Software

arXiv:1710.00446 [pdf, other]

Extracting Insights from the Topology of the JavaScript Package Ecosystem

Authors: Nuttapon Lertwittayatrai, Raula Gaikovina Kula, Saya Onoue, Hideaki Hata, Arnon Rungsawang, Pattara Leelaprute, Kenichi Matsumoto

Abstract: Software ecosystems have had a tremendous impact on computing and society, capturing the attention of businesses, researchers, and policy makers alike. Massive ecosystems like the JavaScript node package manager (npm) is evidence of how packages are readily available for use by software projects. Due to its high-dimension and complex properties, software ecosystem analysis has been limited. In thi… ▽ More Software ecosystems have had a tremendous impact on computing and society, capturing the attention of businesses, researchers, and policy makers alike. Massive ecosystems like the JavaScript node package manager (npm) is evidence of how packages are readily available for use by software projects. Due to its high-dimension and complex properties, software ecosystem analysis has been limited. In this paper, we leverage topological methods in visualize the high-dimensional datasets from a software ecosystem. Topological Data Analysis (TDA) is an emerging technique to analyze high-dimensional datasets, which enables us to study the shape of data. We generate the npm software ecosystem topology to uncover insights and extract patterns of existing libraries by studying its localities. Our real-world example reveals many interesting insights and patterns that describes the shape of a software ecosystem. △ Less

Submitted 1 October, 2017; originally announced October 2017.

Comments: 10 pages, APSEC 2017

arXiv:1709.10324 [pdf, other]

The Health and Wealth of OSS Projects: Evidence from Community Activities and Product Evolution

Authors: Saya Onoue, Raula Gaikovina Kula, Hideaki Hata, Kenichi Matsumoto

Abstract: Background: Understanding the condition of OSS projects is important to analyze features and predict the future of projects. In the field of demography and economics, health and wealth are considered to understand the condition of a country. Aim: In this paper, we apply this framework to OSS projects to understand the communities and the evolution of OSS projects from the perspectives of health an… ▽ More Background: Understanding the condition of OSS projects is important to analyze features and predict the future of projects. In the field of demography and economics, health and wealth are considered to understand the condition of a country. Aim: In this paper, we apply this framework to OSS projects to understand the communities and the evolution of OSS projects from the perspectives of health and wealth. Method: We define two measures of Workforce (WF) and Gross Product Pull Requests (GPPR). We analyze OSS projects in GitHub and investigate three typical cases. Results: We find that wealthy projects attract and rely on the casual workforce. Less wealthy projects may require additional efforts from their more experienced contributors. Conclusions: This paper presents an approach to assess the relationship between health and wealth of OSS projects. An interactive demo of our analysis is available at goo.gl/Ig6NTR. △ Less

Submitted 29 September, 2017; originally announced September 2017.

Comments: Submitted 2017

arXiv:1709.06224 [pdf, other]

Understanding the Heterogeneity of Contributors in Bug Bounty Programs

Authors: Hideaki Hata, Mingyu Guo, M. Ali Babar

Abstract: Background: While bug bounty programs are not new in software development, an increasing number of companies, as well as open source projects, rely on external parties to perform the security assessment of their software for reward. However, there is relatively little empirical knowledge about the characteristics of bug bounty program contributors. Aim: This paper aims to understand those contribu… ▽ More Background: While bug bounty programs are not new in software development, an increasing number of companies, as well as open source projects, rely on external parties to perform the security assessment of their software for reward. However, there is relatively little empirical knowledge about the characteristics of bug bounty program contributors. Aim: This paper aims to understand those contributors by highlighting the heterogeneity among them. Method: We analyzed the histories of 82 bug bounty programs and 2,504 distinct bug bounty contributors, and conducted a quantitative and qualitative survey. Results: We found that there are project-specific and non-specific contributors who have different motivations for contributing to the products and organizations. Conclusions: Our findings provide insights to make bug bounty programs better and for further studies of new software development roles. △ Less

Submitted 18 September, 2017; originally announced September 2017.

Comments: 6 pages, ESEM 2017

arXiv:1709.05768 [pdf, other]

Using High-Rising Cities to Visualize Performance in Real-Time

Authors: Katsuya Ogami, Raula Gaikovina Kula, Hideaki Hata, Takashi Ishio, Kenichi Matsumoto

Abstract: For developers concerned with a performance drop or improvement in their software, a profiler allows a developer to quickly search and identify bottlenecks and leaks that consume much execution time. Non real-time profilers analyze the history of already executed stack traces, while a real-time profiler outputs the results concurrently with the execution of software, so users can know the results… ▽ More For developers concerned with a performance drop or improvement in their software, a profiler allows a developer to quickly search and identify bottlenecks and leaks that consume much execution time. Non real-time profilers analyze the history of already executed stack traces, while a real-time profiler outputs the results concurrently with the execution of software, so users can know the results instantaneously. However, a real-time profiler risks providing overly large and complex outputs, which is difficult for developers to quickly analyze. In this paper, we visualize the performance data from a real-time profiler. We visualize program execution as a three-dimensional (3D) city, representing the structure of the program as artifacts in a city (i.e., classes and packages expressed as buildings and districts) and their program executions expressed as the fluctuating height of artifacts. Through two case studies and using a prototype of our proposed visualization, we demonstrate how our visualization can easily identify performance issues such as a memory leak and compare performance changes between versions of a program. A demonstration of the interactive features of our prototype is available at https://youtu.be/eleVo19Hp4k. △ Less

Submitted 17 September, 2017; originally announced September 2017.

Comments: 10 pages, VISSOFT 2017, Artifact: https://github.com/sefield/high-rising-city-artifact

arXiv:1709.05763 [pdf, other]

Bug or Not? Bug Report Classification Using N-Gram IDF

Authors: Pannavat Terdchanakul, Hideaki Hata, Passakorn Phannachitta, Kenichi Matsumoto

Abstract: Previous studies have found that a significant number of bug reports are misclassified between bugs and non-bugs, and that manually classifying bug reports is a time-consuming task. To address this problem, we propose a bug reports classification model with N-gram IDF, a theoretical extension of Inverse Document Frequency (IDF) for handling words and phrases of any length. N-gram IDF enables us to… ▽ More Previous studies have found that a significant number of bug reports are misclassified between bugs and non-bugs, and that manually classifying bug reports is a time-consuming task. To address this problem, we propose a bug reports classification model with N-gram IDF, a theoretical extension of Inverse Document Frequency (IDF) for handling words and phrases of any length. N-gram IDF enables us to extract key terms of any length from texts, these key terms can be used as the features to classify bug reports. We build classification models with logistic regression and random forest using features from N-gram IDF and topic modeling, which is widely used in various software engineering tasks. With a publicly available dataset, our results show that our N-gram IDF-based models have a superior performance than the topic-based models on all of the evaluated cases. Our models show promising results and have a potential to be extended to other software engineering tasks. △ Less

Submitted 17 September, 2017; originally announced September 2017.

Comments: 5 pages, ICSME 2017

Showing 1–47 of 47 results for author: Hata, H