-
MASAI: Modular Architecture for Software-engineering AI Agents
Authors:
Daman Arora,
Atharv Sonwane,
Nalin Wadhwa,
Abhav Mehrotra,
Saiteja Utpala,
Ramakrishna Bairi,
Aditya Kanade,
Nagarajan Natarajan
Abstract:
A common method to solve complex problems in software engineering, is to divide the problem into multiple sub-problems. Inspired by this, we propose a Modular Architecture for Software-engineering AI (MASAI) agents, where different LLM-powered sub-agents are instantiated with well-defined objectives and strategies tuned to achieve those objectives. Our modular architecture offers several advantage…
▽ More
A common method to solve complex problems in software engineering, is to divide the problem into multiple sub-problems. Inspired by this, we propose a Modular Architecture for Software-engineering AI (MASAI) agents, where different LLM-powered sub-agents are instantiated with well-defined objectives and strategies tuned to achieve those objectives. Our modular architecture offers several advantages: (1) employing and tuning different problem-solving strategies across sub-agents, (2) enabling sub-agents to gather information from different sources scattered throughout a repository, and (3) avoiding unnecessarily long trajectories which inflate costs and add extraneous context. MASAI enabled us to achieve the highest performance (28.33% resolution rate) on the popular and highly challenging SWE-bench Lite dataset consisting of 300 GitHub issues from 11 Python repositories. We conduct a comprehensive evaluation of MASAI relative to other agentic methods and analyze the effects of our design decisions and their contribution to the success of MASAI.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Task Facet Learning: A Structured Approach to Prompt Optimization
Authors:
Gurusha Juneja,
Nagarajan Natarajan,
Hua Li,
Jian Jiao,
Amit Sharma
Abstract:
Given a task in the form of a basic description and its training examples, prompt optimization is the problem of synthesizing the given information into a text prompt for a large language model (LLM). Humans solve this problem by also considering the different facets that define a task (e.g., counter-examples, explanations, analogies) and including them in the prompt. However, it is unclear whethe…
▽ More
Given a task in the form of a basic description and its training examples, prompt optimization is the problem of synthesizing the given information into a text prompt for a large language model (LLM). Humans solve this problem by also considering the different facets that define a task (e.g., counter-examples, explanations, analogies) and including them in the prompt. However, it is unclear whether existing algorithmic approaches, based on iteratively editing a given prompt or automatically selecting a few in-context examples, can cover the multiple facets required to solve a complex task. In this work, we view prompt optimization as that of learning multiple facets of a task from a set of training examples. We identify and exploit structure in the prompt optimization problem -- first, we find that prompts can be broken down into loosely coupled semantic sections that have a relatively independent effect on the prompt's performance; second, we cluster the input space and use clustered batches so that the optimization procedure can learn the different facets of a task across batches. The resulting algorithm, UniPrompt, consists of a generative model to generate initial candidates for each prompt section; and a feedback mechanism that aggregates suggested edits from multiple mini-batches into a conceptual description for the section. Empirical evaluation on multiple datasets and a real-world task shows that prompts generated using UniPrompt obtain higher accuracy than human-tuned prompts and those from state-of-the-art methods. In particular, our algorithm can generate long, complex prompts that existing methods are unable to generate. Code for UniPrompt will be available at \url{https://aka.ms/uniprompt}.
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
Provably Robust DPO: Aligning Language Models with Noisy Feedback
Authors:
Sayak Ray Chowdhury,
Anush Kini,
Nagarajan Natarajan
Abstract:
Learning from preference-based feedback has recently gained traction as a promising approach to align language models with human interests. While these aligned generative models have demonstrated impressive capabilities across various tasks, their dependence on high-quality human preference data poses a bottleneck in practical applications. Specifically, noisy (incorrect and ambiguous) preference…
▽ More
Learning from preference-based feedback has recently gained traction as a promising approach to align language models with human interests. While these aligned generative models have demonstrated impressive capabilities across various tasks, their dependence on high-quality human preference data poses a bottleneck in practical applications. Specifically, noisy (incorrect and ambiguous) preference pairs in the dataset might restrict the language models from capturing human intent accurately. While practitioners have recently proposed heuristics to mitigate the effect of noisy preferences, a complete theoretical understanding of their workings remain elusive.
In this work, we aim to bridge this gap by by introducing a general framework for policy optimization in the presence of random preference flips. We focus on the direct preference optimization (DPO) algorithm in particular since it assumes that preferences adhere to the Bradley-Terry-Luce (BTL) model, raising concerns about the impact of noisy data on the learned policy. We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise. Under log-linear parameterization of the policy class and assuming good feature coverage of the SFT policy, we prove that the sub-optimality gap of the proposed robust DPO (rDPO) policy compared to the optimal policy is of the order $O(\frac{1}{1-2ε}\sqrt{\frac{d}{n}})$, where $ε< 1/2$ is flip rate of labels, $d$ is policy parameter dimension and $n$ is size of dataset. Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.
△ Less
Submitted 11 April, 2024; v1 submitted 1 March, 2024;
originally announced March 2024.
-
NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness
Authors:
Manav Singhal,
Tushar Aggarwal,
Abhijeet Awasthi,
Nagarajan Natarajan,
Aditya Kanade
Abstract:
Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They…
▽ More
Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of requirements and code semantics.
We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of twenty-two code LMs. Our finding is that they generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups. Surprisingly, even the classification accuracy on functional-correctness instances derived from the popular HumanEval benchmark is low, calling in question the depth of their comprehension and the source of their success in generating functionally-correct code in the first place. We will release our benchmark and evaluation scripts publicly at https://aka.ms/NoFunEval.
△ Less
Submitted 2 February, 2024; v1 submitted 29 January, 2024;
originally announced January 2024.
-
GAR-meets-RAG Paradigm for Zero-Shot Information Retrieval
Authors:
Daman Arora,
Anush Kini,
Sayak Ray Chowdhury,
Nagarajan Natarajan,
Gaurav Sinha,
Amit Sharma
Abstract:
Given a query and a document corpus, the information retrieval (IR) task is to output a ranked list of relevant documents. Combining large language models (LLMs) with embedding-based retrieval models, recent work shows promising results on the zero-shot retrieval problem, i.e., no access to labeled data from the target domain. Two such popular paradigms are generation-augmented retrieval or GAR (g…
▽ More
Given a query and a document corpus, the information retrieval (IR) task is to output a ranked list of relevant documents. Combining large language models (LLMs) with embedding-based retrieval models, recent work shows promising results on the zero-shot retrieval problem, i.e., no access to labeled data from the target domain. Two such popular paradigms are generation-augmented retrieval or GAR (generate additional context for the query and then retrieve), and retrieval-augmented generation or RAG (retrieve relevant documents as context and then generate answers). The success of these paradigms hinges on (i) high-recall retrieval models, which are difficult to obtain in the zero-shot setting, and (ii) high-precision (re-)ranking models which typically need a good initialization. In this work, we propose a novel GAR-meets-RAG recurrence formulation that overcomes the challenges of existing paradigms. Our method iteratively improves retrieval (via GAR) and rewrite (via RAG) stages in the zero-shot setting. A key design principle is that the rewrite-retrieval stages improve the recall of the system and a final re-ranking stage improves the precision. We conduct extensive experiments on zero-shot passage retrieval benchmarks, BEIR and TREC-DL. Our method establishes a new state-of-the-art in the BEIR benchmark, outperforming previous best results in Recall@100 and nDCG@10 metrics on 6 out of 8 datasets, with up to 17% relative gains over the previous best.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
Differentially Private Reward Estimation with Preference Feedback
Authors:
Sayak Ray Chowdhury,
Xingyu Zhou,
Nagarajan Natarajan
Abstract:
Learning from preference-based feedback has recently gained considerable traction as a promising approach to align generative models with human interests. Instead of relying on numerical rewards, the generative models are trained using reinforcement learning with human feedback (RLHF). These approaches first solicit feedback from human labelers typically in the form of pairwise comparisons between…
▽ More
Learning from preference-based feedback has recently gained considerable traction as a promising approach to align generative models with human interests. Instead of relying on numerical rewards, the generative models are trained using reinforcement learning with human feedback (RLHF). These approaches first solicit feedback from human labelers typically in the form of pairwise comparisons between two possible actions, then estimate a reward model using these comparisons, and finally employ a policy based on the estimated reward model. An adversarial attack in any step of the above pipeline might reveal private and sensitive information of human labelers. In this work, we adopt the notion of label differential privacy (DP) and focus on the problem of reward estimation from preference-based feedback while protecting privacy of each individual labelers. Specifically, we consider the parametric Bradley-Terry-Luce (BTL) model for such pairwise comparison feedback involving a latent reward parameter $θ^* \in \mathbb{R}^d$. Within a standard minimax estimation framework, we provide tight upper and lower bounds on the error in estimating $θ^*$ under both local and central models of DP. We show, for a given privacy budget $ε$ and number of samples $n$, that the additional cost to ensure label-DP under local model is $Θ\big(\frac{1}{ e^ε-1}\sqrt{\frac{d}{n}}\big)$, while it is $Θ\big(\frac{\text{poly}(d)}{εn} \big)$ under the weaker central model. We perform simulations on synthetic data that corroborate these theoretical results.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
Frustrated with Code Quality Issues? LLMs can Help!
Authors:
Nalin Wadhwa,
Jui Pradhan,
Atharv Sonwane,
Surya Prakash Sahu,
Nagarajan Natarajan,
Aditya Kanade,
Suresh Parthasarathy,
Sriram Rajamani
Abstract:
As software projects progress, quality of code assumes paramount importance as it affects reliability, maintainability and security of software. For this reason, static analysis tools are used in developer workflows to flag code quality issues. However, developers need to spend extra efforts to revise their code to improve code quality based on the tool findings. In this work, we investigate the u…
▽ More
As software projects progress, quality of code assumes paramount importance as it affects reliability, maintainability and security of software. For this reason, static analysis tools are used in developer workflows to flag code quality issues. However, developers need to spend extra efforts to revise their code to improve code quality based on the tool findings. In this work, we investigate the use of (instruction-following) large language models (LLMs) to assist developers in revising code to resolve code quality issues. We present a tool, CORE (short for COde REvisions), architected using a pair of LLMs organized as a duo comprised of a proposer and a ranker. Providers of static analysis tools recommend ways to mitigate the tool warnings and developers follow them to revise their code. The \emph{proposer LLM} of CORE takes the same set of recommendations and applies them to generate candidate code revisions. The candidates which pass the static quality checks are retained. However, the LLM may introduce subtle, unintended functionality changes which may go un-detected by the static analysis. The \emph{ranker LLM} evaluates the changes made by the proposer using a rubric that closely follows the acceptance criteria that a developer would enforce. CORE uses the scores assigned by the ranker LLM to rank the candidate revisions before presenting them to the developer. CORE could revise 59.2% Python files (across 52 quality checks) so that they pass scrutiny by both a tool and a human reviewer. The ranker LLM is able to reduce false positives by 25.8% in these cases. CORE produced revisions that passed the static analysis tool in 76.8% Java files (across 10 quality checks) comparable to 78.3% of a specialized program repair tool, with significantly much less engineering efforts.
△ Less
Submitted 22 September, 2023;
originally announced September 2023.
-
StaticFixer: From Static Analysis to Static Repair
Authors:
Naman Jain,
Shubham Gandhi,
Atharv Sonwane,
Aditya Kanade,
Nagarajan Natarajan,
Suresh Parthasarathy,
Sriram Rajamani,
Rahul Sharma
Abstract:
Static analysis tools are traditionally used to detect and flag programs that violate properties. We show that static analysis tools can also be used to perturb programs that satisfy a property to construct variants that violate the property. Using this insight we can construct paired data sets of unsafe-safe program pairs, and learn strategies to automatically repair property violations. We prese…
▽ More
Static analysis tools are traditionally used to detect and flag programs that violate properties. We show that static analysis tools can also be used to perturb programs that satisfy a property to construct variants that violate the property. Using this insight we can construct paired data sets of unsafe-safe program pairs, and learn strategies to automatically repair property violations. We present a system called \sysname, which automatically repairs information flow vulnerabilities using this approach. Since information flow properties are non-local (both to check and repair), \sysname also introduces a novel domain specific language (DSL) and strategy learning algorithms for synthesizing non-local repairs. We use \sysname to synthesize strategies for repairing two types of information flow vulnerabilities, unvalidated dynamic calls and cross-site scripting, and show that \sysname successfully repairs several hundred vulnerabilities from open source {\sc JavaScript} repositories, outperforming neural baselines built using {\sc CodeT5} and {\sc Codex}. Our datasets can be downloaded from \url{http://aka.ms/StaticFixer}.
△ Less
Submitted 23 July, 2023;
originally announced July 2023.
-
Personalizing Content Moderation on Social Media: User Perspectives on Moderation Choices, Interface Design, and Labor
Authors:
Shagun Jhaver,
Alice Qian Zhang,
Quanze Chen,
Nikhila Natarajan,
Ruotong Wang,
Amy Zhang
Abstract:
Social media platforms moderate content for each user by incorporating the outputs of both platform-wide content moderation systems and, in some cases, user-configured personal moderation preferences. However, it is unclear (1) how end users perceive the choices and affordances of different kinds of personal content moderation tools, and (2) how the introduction of personalization impacts user per…
▽ More
Social media platforms moderate content for each user by incorporating the outputs of both platform-wide content moderation systems and, in some cases, user-configured personal moderation preferences. However, it is unclear (1) how end users perceive the choices and affordances of different kinds of personal content moderation tools, and (2) how the introduction of personalization impacts user perceptions of platforms' content moderation responsibilities. This paper investigates end users' perspectives on personal content moderation tools by conducting an interview study with a diverse sample of 24 active social media users. We probe interviewees' preferences using simulated personal moderation interfaces, including word filters, sliders for toxicity levels, and boolean toxicity toggles. We also examine the labor involved for users in choosing moderation settings and present users' attitudes about the roles and responsibilities of social media platforms and other stakeholders towards moderation. We discuss how our findings can inform design solutions to improve transparency and controllability in personal content moderation tools.
△ Less
Submitted 17 May, 2023;
originally announced May 2023.
-
Trust Explanations to Do What They Say
Authors:
Neil Natarajan,
Reuben Binns,
Jun Zhao,
Nigel Shadbolt
Abstract:
How much are we to trust a decision made by an AI algorithm? Trusting an algorithm without cause may lead to abuse, and mistrusting it may similarly lead to disuse. Trust in an AI is only desirable if it is warranted; thus, calibrating trust is critical to ensuring appropriate use. In the name of calibrating trust appropriately, AI developers should provide contracts specifying use cases in which…
▽ More
How much are we to trust a decision made by an AI algorithm? Trusting an algorithm without cause may lead to abuse, and mistrusting it may similarly lead to disuse. Trust in an AI is only desirable if it is warranted; thus, calibrating trust is critical to ensuring appropriate use. In the name of calibrating trust appropriately, AI developers should provide contracts specifying use cases in which an algorithm can and cannot be trusted. Automated explanation of AI outputs is often touted as a method by which trust can be built in the algorithm. However, automated explanations arise from algorithms themselves, so trust in these explanations is similarly only desirable if it is warranted. Developers of algorithms explaining AI outputs (xAI algorithms) should provide similar contracts, which should specify use cases in which an explanation can and cannot be trusted.
△ Less
Submitted 14 February, 2023;
originally announced March 2023.
-
Simulating Network Paths with Recurrent Buffering Units
Authors:
Divyam Anshumaan,
Sriram Balasubramanian,
Shubham Tiwari,
Nagarajan Natarajan,
Sundararajan Sellamanickam,
Venkata N. Padmanabhan
Abstract:
Simulating physical network paths (e.g., Internet) is a cornerstone research problem in the emerging sub-field of AI-for-networking. We seek a model that generates end-to-end packet delay values in response to the time-varying load offered by a sender, which is typically a function of the previously output delays. The problem setting is unique, and renders the state-of-the-art text and time-series…
▽ More
Simulating physical network paths (e.g., Internet) is a cornerstone research problem in the emerging sub-field of AI-for-networking. We seek a model that generates end-to-end packet delay values in response to the time-varying load offered by a sender, which is typically a function of the previously output delays. The problem setting is unique, and renders the state-of-the-art text and time-series generative models inapplicable or ineffective. We formulate an ML problem at the intersection of dynamical systems, sequential decision making, and time-series modeling. We propose a novel grey-box approach to network simulation that embeds the semantics of physical network path in a new RNN-style model called RBU, providing the interpretability of standard network simulator tools, the power of neural models, the efficiency of SGD-based techniques for learning, and yielding promising results on synthetic and real-world network traces.
△ Less
Submitted 6 December, 2022; v1 submitted 23 February, 2022;
originally announced February 2022.
-
Jigsaw: Large Language Models meet Program Synthesis
Authors:
Naman Jain,
Skanda Vaidyanath,
Arun Iyer,
Nagarajan Natarajan,
Suresh Parthasarathy,
Sriram Rajamani,
Rahul Sharma
Abstract:
Large pre-trained language models such as GPT-3, Codex, and Google's language model are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every progra…
▽ More
Large pre-trained language models such as GPT-3, Codex, and Google's language model are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and caution. On the optimistic side, such large language models have the potential to improve productivity by providing an automated AI pair programmer for every programmer in the world. On the cautionary side, since these large language models do not understand program semantics, they offer no guarantees about quality of the suggested code. In this paper, we present an approach to augment these large language models with post-processing steps based on program analysis and synthesis techniques, that understand the syntax and semantics of programs. Further, we show that such techniques can make use of user feedback and improve with usage. We present our experiences from building and evaluating such a tool jigsaw, targeted at synthesizing code for using Python Pandas API using multi-modal inputs. Our experience suggests that as these large language models evolve for synthesizing code from intent, jigsaw has an important role to play in improving the accuracy of the systems.
△ Less
Submitted 6 December, 2021;
originally announced December 2021.
-
Learning Accurate Decision Trees with Bandit Feedback via Quantized Gradient Descent
Authors:
Ajaykrishna Karthikeyan,
Naman Jain,
Nagarajan Natarajan,
Prateek Jain
Abstract:
Decision trees provide a rich family of highly non-linear but efficient models, due to which they continue to be the go-to family of predictive models by practitioners across domains. But learning trees is challenging due to their discrete decision boundaries. The state-of-the-art (SOTA) techniques resort to (a) learning \textit{soft} trees thereby losing logarithmic inference time; or (b) using m…
▽ More
Decision trees provide a rich family of highly non-linear but efficient models, due to which they continue to be the go-to family of predictive models by practitioners across domains. But learning trees is challenging due to their discrete decision boundaries. The state-of-the-art (SOTA) techniques resort to (a) learning \textit{soft} trees thereby losing logarithmic inference time; or (b) using methods tailored to specific supervised learning settings, requiring access to labeled examples and loss function. In this work, by leveraging techniques like overparameterization and straight-through estimators, we propose a unified method that enables accurate end-to-end gradient based tree training and can be deployed in a variety of settings like offline supervised learning and online learning with bandit feedback. Using extensive validation on standard benchmarks, we demonstrate that our method provides best of both worlds, i.e., it is competitive to, and in some cases more accurate than methods designed \textit{specifically} for the supervised settings; and in bandit settings, where most existing tree learning techniques are not applicable, our models are still accurate and significantly outperform the applicable SOTA methods.
△ Less
Submitted 30 September, 2022; v1 submitted 15 February, 2021;
originally announced February 2021.
-
Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization
Authors:
Aadirupa Saha,
Nagarajan Natarajan,
Praneeth Netrapalli,
Prateek Jain
Abstract:
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions $\f_t$ admit a "pseudo-1d" structure, i.e. $\f_t(\w) = \loss_t(\pred_t(\w))$ where the output of $\pred_t$ is one-dimensional. At each round, the learner observes context $\x_t$, plays prediction $\pred_t(\w_t; \x_t)$ (e.g. $\pred_t(\cdot)=\langle \x_t, \cdot\rangle$) for…
▽ More
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions $\f_t$ admit a "pseudo-1d" structure, i.e. $\f_t(\w) = \loss_t(\pred_t(\w))$ where the output of $\pred_t$ is one-dimensional. At each round, the learner observes context $\x_t$, plays prediction $\pred_t(\w_t; \x_t)$ (e.g. $\pred_t(\cdot)=\langle \x_t, \cdot\rangle$) for some $\w_t \in \mathbb{R}^d$ and observes loss $\loss_t(\pred_t(\w_t))$ where $\loss_t$ is a convex Lipschitz-continuous function. The goal is to minimize the standard regret metric. This pseudo-1d bandit convex optimization problem (\SBCO) arises frequently in domains such as online decision-making or parameter-tuning in large systems. For this problem, we first show a lower bound of $\min(\sqrt{dT}, T^{3/4})$ for the regret of any algorithm, where $T$ is the number of rounds. We propose a new algorithm \sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively, guaranteeing the {\em optimal} regret bound mentioned above, up to additional logarithmic factors. In contrast, applying state-of-the-art online convex optimization methods leads to $\tilde{O}\left(\min\left(d^{9.5}\sqrt{T},\sqrt{d}T^{3/4}\right)\right)$ regret, that is significantly suboptimal in $d$.
△ Less
Submitted 15 February, 2021;
originally announced February 2021.
-
Programming by Rewards
Authors:
Nagarajan Natarajan,
Ajaykrishna Karthikeyan,
Prateek Jain,
Ivan Radicek,
Sriram Rajamani,
Sumit Gulwani,
Johannes Gehrke
Abstract:
We formalize and study ``programming by rewards'' (PBR), a new approach for specifying and synthesizing subroutines for optimizing some quantitative metric such as performance, resource utilization, or correctness over a benchmark. A PBR specification consists of (1) input features $x$, and (2) a reward function $r$, modeled as a black-box component (which we can only run), that assigns a reward f…
▽ More
We formalize and study ``programming by rewards'' (PBR), a new approach for specifying and synthesizing subroutines for optimizing some quantitative metric such as performance, resource utilization, or correctness over a benchmark. A PBR specification consists of (1) input features $x$, and (2) a reward function $r$, modeled as a black-box component (which we can only run), that assigns a reward for each execution. The goal of the synthesizer is to synthesize a "decision function" $f$ which transforms the features to a decision value for the black-box component so as to maximize the expected reward $E[r \circ f (x)]$ for executing decisions $f(x)$ for various values of $x$. We consider a space of decision functions in a DSL of loop-free if-then-else programs, which can branch on linear functions of the input features in a tree-structure and compute a linear function of the inputs in the leaves of the tree. We find that this DSL captures decision functions that are manually written in practice by programmers. Our technical contribution is the use of continuous-optimization techniques to perform synthesis of such decision functions as if-then-else programs. We also show that the framework is theoretically-founded ---in cases when the rewards satisfy nice properties, the synthesized code is optimal in a precise sense.
We have leveraged PBR to synthesize non-trivial decision functions related to search and ranking heuristics in the PROSE codebase (an industrial strength program synthesis framework) and achieve competitive results to manually written procedures over multiple man years of tuning. We present empirical evaluation against other baseline techniques over real-world case studies (including PROSE) as well on simple synthetic benchmarks.
△ Less
Submitted 14 July, 2020;
originally announced July 2020.
-
On Scaling Data-Driven Loop Invariant Inference
Authors:
Sahil Bhatia,
Saswat Padhi,
Nagarajan Natarajan,
Rahul Sharma,
Prateek Jain
Abstract:
Automated synthesis of inductive invariants is an important problem in software verification. Once all the invariants have been specified, software verification reduces to checking of verification conditions. Although static analyses to infer invariants have been studied for over forty years, recent years have seen a flurry of data-driven invariant inference techniques which guess invariants from…
▽ More
Automated synthesis of inductive invariants is an important problem in software verification. Once all the invariants have been specified, software verification reduces to checking of verification conditions. Although static analyses to infer invariants have been studied for over forty years, recent years have seen a flurry of data-driven invariant inference techniques which guess invariants from examples instead of analyzing program text. However, these techniques have been demonstrated to scale only to programs with a small number of variables. In this paper, we study these scalability issues and address them in our tool oasis that improves the scale of data-driven invariant inference and outperforms state-of-the-art systems on benchmarks from the invariant inference track of the Syntax Guided Synthesis competition.
△ Less
Submitted 16 July, 2020; v1 submitted 26 November, 2019;
originally announced November 2019.
-
Leveraging Distributional Semantics for Multi-Label Learning
Authors:
Rahul Wadbude,
Vivek Gupta,
Piyush Rai,
Nagarajan Natarajan,
Harish Karnick,
Prateek Jain
Abstract:
We present a novel and scalable label embedding framework for large-scale multi-label learning a.k.a ExMLDS (Extreme Multi-Label Learning using Distributional Semantics). Our approach draws inspiration from ideas rooted in distributional semantics, specifically the Skip Gram Negative Sampling (SGNS) approach, widely used to learn word embeddings for natural language processing tasks. Learning such…
▽ More
We present a novel and scalable label embedding framework for large-scale multi-label learning a.k.a ExMLDS (Extreme Multi-Label Learning using Distributional Semantics). Our approach draws inspiration from ideas rooted in distributional semantics, specifically the Skip Gram Negative Sampling (SGNS) approach, widely used to learn word embeddings for natural language processing tasks. Learning such embeddings can be reduced to a certain matrix factorization. Our approach is novel in that it highlights interesting connections between label embedding methods used for multi-label learning and paragraph/document embedding methods commonly used for learning representations of text data. The framework can also be easily extended to incorporate auxiliary information such as label-label correlations; this is crucial especially when there are a lot of missing labels in the training data. We demonstrate the effectiveness of our approach through an extensive set of experiments on a variety of benchmark datasets, and show that the proposed learning methods perform favorably compared to several baselines and state-of-the-art methods for large-scale multi-label learning. To facilitate end-to-end learning, we develop a joint learning algorithm that can learn the embeddings as well as a regression model that predicts these embeddings given input features, via efficient gradient-based methods.
△ Less
Submitted 10 November, 2017; v1 submitted 18 September, 2017;
originally announced September 2017.
-
Regret Bounds for Non-decomposable Metrics with Missing Labels
Authors:
Prateek Jain,
Nagarajan Natarajan
Abstract:
We consider the problem of recommending relevant labels (items) for a given data point (user). In particular, we are interested in the practically important setting where the evaluation is with respect to non-decomposable (over labels) performance metrics like the $F_1$ measure, and the training data has missing labels. To this end, we propose a generic framework that given a performance metric…
▽ More
We consider the problem of recommending relevant labels (items) for a given data point (user). In particular, we are interested in the practically important setting where the evaluation is with respect to non-decomposable (over labels) performance metrics like the $F_1$ measure, and the training data has missing labels. To this end, we propose a generic framework that given a performance metric $Ψ$, can devise a regularized objective function and a threshold such that all the values in the predicted score vector above and only above the threshold are selected to be positive. We show that the regret or generalization error in the given metric $Ψ$ is bounded ultimately by estimation error of certain underlying parameters. In particular, we derive regret bounds under three popular settings: a) collaborative filtering, b) multilabel classification, and c) PU (positive-unlabeled) learning. For each of the above problems, we can obtain precise non-asymptotic regret bound which is small even when a large fraction of labels is missing. Our empirical results on synthetic and benchmark datasets demonstrate that by explicitly modeling for missing labels and optimizing the desired performance metric, our algorithm indeed achieves significantly better performance (like $F_1$ score) when compared to methods that do not model missing label information carefully.
△ Less
Submitted 7 June, 2016;
originally announced June 2016.
-
Learning from Binary Labels with Instance-Dependent Corruption
Authors:
Aditya Krishna Menon,
Brendan van Rooyen,
Nagarajan Natarajan
Abstract:
Suppose we have a sample of instances paired with binary labels corrupted by arbitrary instance- and label-dependent noise. With sufficiently many such samples, can we optimally classify and rank instances with respect to the noise-free distribution? We provide a theoretical analysis of this question, with three main contributions. First, we prove that for instance-dependent noise, any algorithm t…
▽ More
Suppose we have a sample of instances paired with binary labels corrupted by arbitrary instance- and label-dependent noise. With sufficiently many such samples, can we optimally classify and rank instances with respect to the noise-free distribution? We provide a theoretical analysis of this question, with three main contributions. First, we prove that for instance-dependent noise, any algorithm that is consistent for classification on the noisy distribution is also consistent on the clean distribution. Second, we prove that for a broad class of instance- and label-dependent noise, a similar consistency result holds for the area under the ROC curve. Third, for the latter noise model, when the noise-free class-probability function belongs to the generalised linear model family, we show that the Isotron can efficiently and provably learn from the corrupted sample.
△ Less
Submitted 4 May, 2016; v1 submitted 3 May, 2016;
originally announced May 2016.
-
Optimal Decision-Theoretic Classification Using Non-Decomposable Performance Metrics
Authors:
Nagarajan Natarajan,
Oluwasanmi Koyejo,
Pradeep Ravikumar,
Inderjit S. Dhillon
Abstract:
We provide a general theoretical analysis of expected out-of-sample utility, also referred to as decision-theoretic classification, for non-decomposable binary classification metrics such as F-measure and Jaccard coefficient. Our key result is that the expected out-of-sample utility for many performance metrics is provably optimized by a classifier which is equivalent to a signed thresholding of t…
▽ More
We provide a general theoretical analysis of expected out-of-sample utility, also referred to as decision-theoretic classification, for non-decomposable binary classification metrics such as F-measure and Jaccard coefficient. Our key result is that the expected out-of-sample utility for many performance metrics is provably optimized by a classifier which is equivalent to a signed thresholding of the conditional probability of the positive class. Our analysis bridges a gap in the literature on binary classification, revealed in light of recent results for non-decomposable metrics in population utility maximization style classification. Our results identify checkable properties of a performance metric which are sufficient to guarantee a probability ranking principle. We propose consistent estimators for optimal expected out-of-sample classification. As a consequence of the probability ranking principle, computational requirements can be reduced from exponential to cubic complexity in the general case, and further reduced to quadratic complexity in special cases. We provide empirical results on simulated and benchmark datasets evaluating the performance of the proposed algorithms for decision-theoretic classification and comparing them to baseline and state-of-the-art methods in population utility maximization for non-decomposable metrics.
△ Less
Submitted 7 May, 2015;
originally announced May 2015.
-
PU Learning for Matrix Completion
Authors:
Cho-Jui Hsieh,
Nagarajan Natarajan,
Inderjit S. Dhillon
Abstract:
In this paper, we consider the matrix completion problem when the observations are one-bit measurements of some underlying matrix M, and in particular the observed samples consist only of ones and no zeros. This problem is motivated by modern applications such as recommender systems and social networks where only "likes" or "friendships" are observed. The problem of learning from only positive and…
▽ More
In this paper, we consider the matrix completion problem when the observations are one-bit measurements of some underlying matrix M, and in particular the observed samples consist only of ones and no zeros. This problem is motivated by modern applications such as recommender systems and social networks where only "likes" or "friendships" are observed. The problem of learning from only positive and unlabeled examples, called PU (positive-unlabeled) learning, has been studied in the context of binary classification. We consider the PU matrix completion problem, where an underlying real-valued matrix M is first quantized to generate one-bit observations and then a subset of positive entries is revealed. Under the assumption that M has bounded nuclear norm, we provide recovery guarantees for two different observation models: 1) M parameterizes a distribution that generates a binary matrix, 2) M is thresholded to obtain a binary matrix. For the first case, we propose a "shifted matrix completion" method that recovers M using only a subset of indices corresponding to ones, while for the second case, we propose a "biased matrix completion" method that recovers the (thresholded) binary matrix. Both methods yield strong error bounds --- if M is n by n, the Frobenius error is bounded as O(1/((1-rho)n), where 1-rho denotes the fraction of ones observed. This implies a sample complexity of O(n\log n) ones to achieve a small error, when M is dense and n is large. We extend our methods and guarantees to the inductive matrix completion problem, where rows and columns of M have associated features. We provide efficient and scalable optimization procedures for both the methods and demonstrate the effectiveness of the proposed methods for link prediction (on real-world networks consisting of over 2 million nodes and 90 million links) and semi-supervised clustering tasks.
△ Less
Submitted 21 November, 2014;
originally announced November 2014.
-
Prediction and Clustering in Signed Networks: A Local to Global Perspective
Authors:
Kai-Yang Chiang,
Cho-Jui Hsieh,
Nagarajan Natarajan,
Ambuj Tewari,
Inderjit S. Dhillon
Abstract:
The study of social networks is a burgeoning research area. However, most existing work deals with networks that simply encode whether relationships exist or not. In contrast, relationships in signed networks can be positive ("like", "trust") or negative ("dislike", "distrust"). The theory of social balance shows that signed networks tend to conform to some local patterns that, in turn, induce cer…
▽ More
The study of social networks is a burgeoning research area. However, most existing work deals with networks that simply encode whether relationships exist or not. In contrast, relationships in signed networks can be positive ("like", "trust") or negative ("dislike", "distrust"). The theory of social balance shows that signed networks tend to conform to some local patterns that, in turn, induce certain global characteristics. In this paper, we exploit both local as well as global aspects of social balance theory for two fundamental problems in the analysis of signed networks: sign prediction and clustering. Motivated by local patterns of social balance, we first propose two families of sign prediction methods: measures of social imbalance (MOIs), and supervised learning using high order cycles (HOCs). These methods predict signs of edges based on triangles and \ell-cycles for relatively small values of \ell. Interestingly, by examining measures of social imbalance, we show that the classic Katz measure, which is used widely in unsigned link prediction, actually has a balance theoretic interpretation when applied to signed networks. Furthermore, motivated by the global structure of balanced networks, we propose an effective low rank modeling approach for both sign prediction and clustering. For the low rank modeling approach, we provide theoretical performance guarantees via convex relaxations, scale it up to large problem sizes using a matrix factorization based algorithm, and provide extensive experimental validation including comparisons with local approaches. Our experimental results indicate that, by adopting a more global viewpoint of balance structure, we get significant performance and computational gains in prediction and clustering tasks on signed networks. Our work therefore highlights the usefulness of the global aspect of balance theory for the analysis of signed networks.
△ Less
Submitted 4 March, 2013; v1 submitted 20 February, 2013;
originally announced February 2013.