Search | arXiv e-print repository

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Authors: Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang

Abstract: Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an earl… ▽ More Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions. △ Less

Submitted 13 April, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

arXiv:2102.09803 [pdf]

doi 10.1145/3449247

Meeting Effectiveness and Inclusiveness in Remote Collaboration

Authors: Ross Cutler, Yasaman Hosseinkashi, Jamie Pool, Senja Filipi, Robert Aichner, Yuan Tu, Johannes Gehrke

Abstract: A primary goal of remote collaboration tools is to provide effective and inclusive meetings for all participants. To study meeting effectiveness and meeting inclusiveness, we first conducted a large-scale email survey (N=4,425; after filtering N=3,290) at a large technology company (pre-COVID-19); using this data we derived a multivariate model of meeting effectiveness and show how it correlates w… ▽ More A primary goal of remote collaboration tools is to provide effective and inclusive meetings for all participants. To study meeting effectiveness and meeting inclusiveness, we first conducted a large-scale email survey (N=4,425; after filtering N=3,290) at a large technology company (pre-COVID-19); using this data we derived a multivariate model of meeting effectiveness and show how it correlates with meeting inclusiveness, participation, and feeling comfortable to contribute. We believe this is the first such model of meeting effectiveness and inclusiveness. The large size of the data provided the opportunity to analyze correlations that are specific to sub-populations such as the impact of video. The model shows the following factors are correlated with inclusiveness, effectiveness, participation, and feeling comfortable to contribute in meetings: sending a pre-meeting communication, sending a post-meeting summary, including a meeting agenda, attendee location, remote-only meeting, audio/video quality and reliability, video usage, and meeting size. The model and survey results give a quantitative understanding of how and where to improve meeting effectiveness and inclusiveness and what the potential returns are. Motivated by the email survey results, we implemented a post-meeting survey into a leading computer-mediated communication (CMC) system to directly measure meeting effectiveness and inclusiveness (during COVID-19). Using initial results based on internal flighting we created a similar model of effectiveness and inclusiveness, with many of the same findings as the email survey. This shows a method of measuring and understanding these metrics which are both practical and useful in a commercial CMC system. △ Less

Submitted 19 February, 2021; originally announced February 2021.

arXiv:2011.12715 [pdf, other]

Resonance: Replacing Software Constants with Context-Aware Models in Real-time Communication

Authors: Jayant Gupchup, Ashkan Aazami, Yaran Fan, Senja Filipi, Tom Finley, Scott Inglis, Marcus Asteborg, Luke Caroll, Rajan Chari, Markus Cozowicz, Vishak Gopal, Vinod Prakash, Sasikanth Bendapudi, Jack Gerrits, Eric Lau, Huazhou Liu, Marco Rossi, Dima Slobodianyk, Dmitri Birjukov, Matty Cooper, Nilesh Javar, Dmitriy Perednya, Sriram Srinivasan, John Langford, Ross Cutler , et al. (1 additional authors not shown)

Abstract: Large software systems tune hundreds of 'constants' to optimize their runtime performance. These values are commonly derived through intuition, lab tests, or A/B tests. A 'one-size-fits-all' approach is often sub-optimal as the best value depends on runtime context. In this paper, we provide an experimental approach to replace constants with learned contextual functions for Skype - a widely used r… ▽ More Large software systems tune hundreds of 'constants' to optimize their runtime performance. These values are commonly derived through intuition, lab tests, or A/B tests. A 'one-size-fits-all' approach is often sub-optimal as the best value depends on runtime context. In this paper, we provide an experimental approach to replace constants with learned contextual functions for Skype - a widely used real-time communication (RTC) application. We present Resonance, a system based on contextual bandits (CB). We describe experiences from three real-world experiments: applying it to the audio, video, and transport components in Skype. We surface a unique and practical challenge of performing machine learning (ML) inference in large software systems written using encapsulation principles. Finally, we open-source FeatureBroker, a library to reduce the friction in adopting ML models in such development environments △ Less

Submitted 22 November, 2020; originally announced November 2020.

Comments: Workshop on ML for Systems at NeurIPS 2020, Accepted

Journal ref: ML for Systems, NeurIPS 2020

arXiv:2007.06835 [pdf, other]

Programming by Rewards

Authors: Nagarajan Natarajan, Ajaykrishna Karthikeyan, Prateek Jain, Ivan Radicek, Sriram Rajamani, Sumit Gulwani, Johannes Gehrke

Abstract: We formalize and study ``programming by rewards'' (PBR), a new approach for specifying and synthesizing subroutines for optimizing some quantitative metric such as performance, resource utilization, or correctness over a benchmark. A PBR specification consists of (1) input features $x$, and (2) a reward function $r$, modeled as a black-box component (which we can only run), that assigns a reward f… ▽ More We formalize and study ``programming by rewards'' (PBR), a new approach for specifying and synthesizing subroutines for optimizing some quantitative metric such as performance, resource utilization, or correctness over a benchmark. A PBR specification consists of (1) input features $x$, and (2) a reward function $r$, modeled as a black-box component (which we can only run), that assigns a reward for each execution. The goal of the synthesizer is to synthesize a "decision function" $f$ which transforms the features to a decision value for the black-box component so as to maximize the expected reward $E[r \circ f (x)]$ for executing decisions $f(x)$ for various values of $x$. We consider a space of decision functions in a DSL of loop-free if-then-else programs, which can branch on linear functions of the input features in a tree-structure and compute a linear function of the inputs in the leaves of the tree. We find that this DSL captures decision functions that are manually written in practice by programmers. Our technical contribution is the use of continuous-optimization techniques to perform synthesis of such decision functions as if-then-else programs. We also show that the framework is theoretically-founded ---in cases when the rewards satisfy nice properties, the synthesized code is optimal in a precise sense. We have leveraged PBR to synthesize non-trivial decision functions related to search and ranking heuristics in the PROSE codebase (an industrial strength program synthesis framework) and achieve competitive results to manually written procedures over multiple man years of tuning. We present empirical evaluation against other baseline techniques over real-world case studies (including PROSE) as well on simple synthetic benchmarks. △ Less

Submitted 14 July, 2020; originally announced July 2020.

arXiv:2006.12793 [pdf, other]

Lumos: A Library for Diagnosing Metric Regressions in Web-Scale Applications

Authors: Jamie Pool, Ebrahim Beyrami, Vishak Gopal, Ashkan Aazami, Jayant Gupchup, Jeff Rowland, Binlong Li, Pritesh Kanani, Ross Cutler, Johannes Gehrke

Abstract: Web-scale applications can ship code on a daily to weekly cadence. These applications rely on online metrics to monitor the health of new releases. Regressions in metric values need to be detected and diagnosed as early as possible to reduce the disruption to users and product owners. Regressions in metrics can surface due to a variety of reasons: genuine product regressions, changes in user popul… ▽ More Web-scale applications can ship code on a daily to weekly cadence. These applications rely on online metrics to monitor the health of new releases. Regressions in metric values need to be detected and diagnosed as early as possible to reduce the disruption to users and product owners. Regressions in metrics can surface due to a variety of reasons: genuine product regressions, changes in user population, and bias due to telemetry loss (or processing) are among the common causes. Diagnosing the cause of these metric regressions is costly for engineering teams as they need to invest time in finding the root cause of the issue as soon as possible. We present Lumos, a Python library built using the principles of AB testing to systematically diagnose metric regressions to automate such analysis. Lumos has been deployed across the component teams in Microsoft's Real-Time Communication applications Skype and Microsoft Teams. It has enabled engineering teams to detect 100s of real changes in metrics and reject 1000s of false alarms detected by anomaly detectors. The application of Lumos has resulted in freeing up as much as 95% of the time allocated to metric-based investigations. In this work, we open source Lumos and present our results from applying it to two different components within the RTC group over millions of sessions. This general library can be coupled with any production system to manage the volume of alerting efficiently. △ Less

Submitted 23 June, 2020; originally announced June 2020.

arXiv:2005.13981 [pdf]

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results

Authors: Chandan K. A. Reddy, Vishak Gopal, Ross Cutler, Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, Johannes Gehrke

Abstract: The INTERSPEECH 2020 Deep Noise Suppression (DNS) Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement aimed to maximize the subjective (perceptual) quality of the enhanced speech. A typical approach to evaluate the noise suppression methods is to use objective metrics on the test set obtained by splitting the original dataset. While the performanc… ▽ More The INTERSPEECH 2020 Deep Noise Suppression (DNS) Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement aimed to maximize the subjective (perceptual) quality of the enhanced speech. A typical approach to evaluate the noise suppression methods is to use objective metrics on the test set obtained by splitting the original dataset. While the performance is good on the synthetic test set, often the model performance degrades significantly on real recordings. Also, most of the conventional objective metrics do not correlate well with subjective tests and lab subjective tests are not scalable for a large test set. In this challenge, we open-sourced a large clean speech and noise corpus for training the noise suppression models and a representative test set to real-world scenarios consisting of both synthetic and real recordings. We also open-sourced an online subjective test framework based on ITU-T P.808 for researchers to reliably test their developments. We evaluated the results using P.808 on a blind test set. The results and the key learnings from the challenge are discussed. The datasets and scripts can be found here for quick access https://github.com/microsoft/DNS-Challenge. △ Less

Submitted 18 October, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

Comments: Interspeech 2020. arXiv admin note: substantial text overlap with arXiv:2001.08662

arXiv:2004.10898 [pdf, other]

doi 10.1145/3318464.3389770

Qd-tree: Learning Data Layouts for Big Data Analytics

Authors: Zongheng Yang, Badrish Chandramouli, Chi Wang, Johannes Gehrke, Yinan Li, Umar Farooq Minhas, Per-Åke Larson, Donald Kossmann, Rajeev Acharya

Abstract: Corporations today collect data at an unprecedented and accelerating scale, making the need to run queries on large datasets increasingly important. Technologies such as columnar block-based data organization and compression have become standard practice in most commercial database systems. However, the problem of best assigning records to data blocks on storage is still open. For example, today's… ▽ More Corporations today collect data at an unprecedented and accelerating scale, making the need to run queries on large datasets increasingly important. Technologies such as columnar block-based data organization and compression have become standard practice in most commercial database systems. However, the problem of best assigning records to data blocks on storage is still open. For example, today's systems usually partition data by arrival time into row groups, or range/hash partition the data based on selected fields. For a given workload, however, such techniques are unable to optimize for the important metric of the number of blocks accessed by a query. This metric directly relates to the I/O cost, and therefore performance, of most analytical queries. Further, they are unable to exploit additional available storage to drive this metric down further. In this paper, we propose a new framework called a query-data routing tree, or qd-tree, to address this problem, and propose two algorithms for their construction based on greedy and deep reinforcement learning techniques. Experiments over benchmark and real workloads show that a qd-tree can provide physical speedups of more than an order of magnitude compared to current blocking schemes, and can reach within 2X of the lower bound for data skipping based on selectivity, while providing complete semantic descriptions of created blocks. △ Less

Submitted 22 April, 2020; originally announced April 2020.

Comments: ACM SIGMOD 2020

arXiv:2003.04150 [pdf, other]

Lightweight Inter-transaction Caching with Precise Clocks and Dynamic Self-invalidation

Authors: Pulkit A. Misra, Srihari Radhakrishnan, Jeffrey S. Chase, Johannes Gehrke, Alvin R. Lebeck

Abstract: Distributed, transactional storage systems scale by sharding data across servers. However, workload-induced hotspots result in contention, leading to higher abort rates and performance degradation. We present KAIROS, a transactional key-value storage system that leverages client-side inter-transaction caching and sharded transaction validation to balance the dynamic load and alleviate workload-i… ▽ More Distributed, transactional storage systems scale by sharding data across servers. However, workload-induced hotspots result in contention, leading to higher abort rates and performance degradation. We present KAIROS, a transactional key-value storage system that leverages client-side inter-transaction caching and sharded transaction validation to balance the dynamic load and alleviate workload-induced hotspots in the system. KAIROS utilizes precise synchronized clocks to implement self-invalidating leases for cache consistency and avoids the overhead and potential hotspots due to maintaining sharing lists or sending invalidations. Experiments show that inter-transaction caching alone provides 2.35x the throughput of a baseline system with only intra-transaction caching; adding sharded validation further improves the throughput by a factor of 3.1 over baseline. We also show that lease-based caching can operate at a 30% higher scale while providing 1.46x the throughput of the state-of-the-art explicit invalidation-based caching. △ Less

Submitted 9 March, 2020; originally announced March 2020.

arXiv:2001.08662 [pdf]

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework

Authors: Chandan K. A. Reddy, Ebrahim Beyrami, Harishchandra Dubey, Vishak Gopal, Roger Cheng, Ross Cutler, Sergiy Matusevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, Puneet Rana, Sriram Srinivasan, Johannes Gehrke

Abstract: The INTERSPEECH 2020 Deep Noise Suppression Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement aimed to maximize the subjective (perceptual) quality of the enhanced speech. A typical approach to evaluate the noise suppression methods is to use objective metrics on the test set obtained by splitting the original dataset. Many publications report r… ▽ More The INTERSPEECH 2020 Deep Noise Suppression Challenge is intended to promote collaborative research in real-time single-channel Speech Enhancement aimed to maximize the subjective (perceptual) quality of the enhanced speech. A typical approach to evaluate the noise suppression methods is to use objective metrics on the test set obtained by splitting the original dataset. Many publications report reasonable performance on the synthetic test set drawn from the same distribution as that of the training set. However, often the model performance degrades significantly on real recordings. Also, most of the conventional objective metrics do not correlate well with subjective tests and lab subjective tests are not scalable for a large test set. In this challenge, we open-source a large clean speech and noise corpus for training the noise suppression models and a representative test set to real-world scenarios consisting of both synthetic and real recordings. We also open source an online subjective test framework based on ITU-T P.808 for researchers to quickly test their developments. The winners of this challenge will be selected based on subjective evaluation on a representative test set using P.808 framework. △ Less

Submitted 19 April, 2020; v1 submitted 23 January, 2020; originally announced January 2020.

Comments: Details about Deep Noise Suppression Challenge

arXiv:1912.02222 [pdf, other]

Reinforcement learning for bandwidth estimation and congestion control in real-time communications

Authors: Joyce Fang, Martin Ellis, Bin Li, Siyao Liu, Yasaman Hosseinkashi, Michael Revow, Albert Sadovnikov, Ziyuan Liu, Peng Cheng, Sachin Ashok, David Zhao, Ross Cutler, Yan Lu, Johannes Gehrke

Abstract: Bandwidth estimation and congestion control for real-time communications (i.e., audio and video conferencing) remains a difficult problem, despite many years of research. Achieving high quality of experience (QoE) for end users requires continual updates due to changing network architectures and technologies. In this paper, we apply reinforcement learning for the first time to the problem of real-… ▽ More Bandwidth estimation and congestion control for real-time communications (i.e., audio and video conferencing) remains a difficult problem, despite many years of research. Achieving high quality of experience (QoE) for end users requires continual updates due to changing network architectures and technologies. In this paper, we apply reinforcement learning for the first time to the problem of real-time communications (RTC), where we seek to optimize user-perceived quality. We present initial proof-of-concept results, where we learn an agent to control sending rate in an RTC system, evaluating using both network simulation and real Internet video calls. We discuss the challenges we observed, particularly in designing realistic reward functions that reflect QoE, and in bridging the gap between the training environment and real-world networks. △ Less

Submitted 4 December, 2019; originally announced December 2019.

Comments: Workshop on ML for Systems at NeurIPS 2019

arXiv:1912.00580 [pdf, other]

Multi-version Indexing in Flash-based Key-Value Stores

Authors: Pulkit A. Misra, Jeffrey S. Chase, Johannes Gehrke, Alvin R. Lebeck

Abstract: Maintaining multiple versions of data is popular in key-value stores since it increases concurrency and improves performance. However, designing a multi-version key-value store entails several challenges, such as additional capacity for storing extra versions and an indexing mechanism for mapping versions of a key to their values. We present SkimpyFTL, a FTL-integrated multi-version key-value stor… ▽ More Maintaining multiple versions of data is popular in key-value stores since it increases concurrency and improves performance. However, designing a multi-version key-value store entails several challenges, such as additional capacity for storing extra versions and an indexing mechanism for mapping versions of a key to their values. We present SkimpyFTL, a FTL-integrated multi-version key-value store that exploits the remap-on-write property of flash-based SSDs for multi-versioning and provides a tradeoff between memory capacity and lookup latency for indexing. △ Less

Submitted 2 December, 2019; originally announced December 2019.

Comments: 7 pages, 6 figures

arXiv:1909.08050 [pdf]

A scalable noisy speech dataset and online subjective test framework

Authors: Chandan K. A. Reddy, Ebrahim Beyrami, Jamie Pool, Ross Cutler, Sriram Srinivasan, Johannes Gehrke

Abstract: Background noise is a major source of quality impairments in Voice over Internet Protocol (VoIP) and Public Switched Telephone Network (PSTN) calls. Recent work shows the efficacy of deep learning for noise suppression, but the datasets have been relatively small compared to those used in other domains (e.g., ImageNet) and the associated evaluations have been more focused. In order to better facil… ▽ More Background noise is a major source of quality impairments in Voice over Internet Protocol (VoIP) and Public Switched Telephone Network (PSTN) calls. Recent work shows the efficacy of deep learning for noise suppression, but the datasets have been relatively small compared to those used in other domains (e.g., ImageNet) and the associated evaluations have been more focused. In order to better facilitate deep learning research in Speech Enhancement, we present a noisy speech dataset (MS-SNSD) that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio (SNR) levels desired. We show that increasing dataset sizes increases noise suppression performance as expected. In addition, we provide an open-source evaluation methodology to evaluate the results subjectively at scale using crowdsourcing, with a reference algorithm to normalize the results. To demonstrate the dataset and evaluation framework we apply it to several noise suppressors and compare the subjective Mean Opinion Score (MOS) with objective quality measures such as SNR, PESQ, POLQA, and VISQOL and show why MOS is still required. Our subjective MOS evaluation is the first large scale evaluation of Speech Enhancement algorithms that we are aware of. △ Less

Submitted 17 September, 2019; originally announced September 2019.

Comments: InterSpeech 2019

arXiv:1907.01742 [pdf]

Supervised Classifiers for Audio Impairments with Noisy Labels

Authors: Chandan K A Reddy, Ross Cutler, Johannes Gehrke

Abstract: Voice-over-Internet-Protocol (VoIP) calls are prone to various speech impairments due to environmental and network conditions resulting in bad user experience. A reliable audio impairment classifier helps to identify the cause for bad audio quality. The user feedback after the call can act as the ground truth labels for training a supervised classifier on a large audio dataset. However, the labels… ▽ More Voice-over-Internet-Protocol (VoIP) calls are prone to various speech impairments due to environmental and network conditions resulting in bad user experience. A reliable audio impairment classifier helps to identify the cause for bad audio quality. The user feedback after the call can act as the ground truth labels for training a supervised classifier on a large audio dataset. However, the labels are noisy as most of the users lack the expertise to precisely articulate the impairment in the perceived speech. In this paper, we analyze the effects of massive noise in labels in training dense networks and Convolutional Neural Networks (CNN) using engineered features, spectrograms and raw audio samples as inputs. We demonstrate that CNN can generalize better on the training data with a large number of noisy labels and gives remarkably higher test performance. The classifiers were trained both on randomly generated label noise and the label noise introduced by human errors. We also show that training with noisy labels requires a significant increase in the training dataset size, which is in proportion to the amount of noise in the labels. △ Less

Submitted 3 July, 2019; originally announced July 2019.

Comments: To appear in INTERSPEECH 2019

arXiv:1905.08898 [pdf, other]

doi 10.1145/3318464.3389711

ALEX: An Updatable Adaptive Learned Index

Authors: Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, Tim Kraska

Abstract: Recent work on "learned indexes" has changed the way we look at the decades-old field of DBMS indexing. The key idea is that indexes can be thought of as "models" that predict the position of a key in a dataset. Indexes can, thus, be learned. The original work by Kraska et al. shows that a learned index beats a B+Tree by a factor of up to three in search time and by an order of magnitude in memory… ▽ More Recent work on "learned indexes" has changed the way we look at the decades-old field of DBMS indexing. The key idea is that indexes can be thought of as "models" that predict the position of a key in a dataset. Indexes can, thus, be learned. The original work by Kraska et al. shows that a learned index beats a B+Tree by a factor of up to three in search time and by an order of magnitude in memory footprint. However, it is limited to static, read-only workloads. In this paper, we present a new learned index called ALEX which addresses practical issues that arise when implementing learned indexes for workloads that contain a mix of point lookups, short range queries, inserts, updates, and deletes. ALEX effectively combines the core insights from learned indexes with proven storage and indexing techniques to achieve high performance and low memory footprint. On read-only workloads, ALEX beats the learned index from Kraska et al. by up to 2.2X on performance with up to 15X smaller index size. Across the spectrum of read-write workloads, ALEX beats B+Trees by up to 4.1X while never performing worse, with up to 2000X smaller index size. We believe ALEX presents a key step towards making learned indexes practical for a broader class of database workloads with dynamic updates. △ Less

Submitted 20 May, 2020; v1 submitted 21 May, 2019; originally announced May 2019.

Report number: MSR-TR-2020-12

arXiv:1905.06425 [pdf, other]

An Empirical Analysis of Deep Learning for Cardinality Estimation

Authors: Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, S. Sathiya Keerthi

Abstract: We implement and evaluate deep learning for cardinality estimation by studying the accuracy, space and time trade-offs across several architectures. We find that simple deep learning models can learn cardinality estimations across a variety of datasets (reducing the error by 72% - 98% on average compared to PostgreSQL). In addition, we empirically evaluate the impact of injecting cardinality estim… ▽ More We implement and evaluate deep learning for cardinality estimation by studying the accuracy, space and time trade-offs across several architectures. We find that simple deep learning models can learn cardinality estimations across a variety of datasets (reducing the error by 72% - 98% on average compared to PostgreSQL). In addition, we empirically evaluate the impact of injecting cardinality estimates produced by deep learning models into the PostgreSQL optimizer. In many cases, the estimates from these models lead to better query plans across all datasets, reducing the runtimes by up to 49% on select-project-join workloads. As promising as these models are, we also discuss and address some of the challenges of using them in practice. △ Less

Submitted 11 September, 2019; v1 submitted 15 May, 2019; originally announced May 2019.

arXiv:1903.06908 [pdf, other]

Non-intrusive speech quality assessment using neural networks

Authors: Anderson R. Avila, Hannes Gamper, Chandan Reddy, Ross Cutler, Ivan Tashev, Johannes Gehrke

Abstract: Estimating the perceived quality of an audio signal is critical for many multimedia and audio processing systems. Providers strive to offer optimal and reliable services in order to increase the user quality of experience (QoE). In this work, we present an investigation of the applicability of neural networks for non-intrusive audio quality assessment. We propose three neural network-based approac… ▽ More Estimating the perceived quality of an audio signal is critical for many multimedia and audio processing systems. Providers strive to offer optimal and reliable services in order to increase the user quality of experience (QoE). In this work, we present an investigation of the applicability of neural networks for non-intrusive audio quality assessment. We propose three neural network-based approaches for mean opinion score (MOS) estimation. We compare our results to three instrumental measures: the perceptual evaluation of speech quality (PESQ), the ITU-T Recommendation P.563, and the speech-to-reverberation energy ratio. Our evaluation uses a speech dataset contaminated with convolutive and additive noise, labeled using a crowd-based QoE evaluation, evaluated with Pearson correlation with MOS labels, and mean-squared-error of the estimated MOS. Our proposed approaches outperform the aforementioned instrumental measures, with a fully connected deep neural network using Mel-frequency features providing the best correlation (0.87) and the lowest mean squared error (0.15) △ Less

Submitted 16 March, 2019; originally announced March 2019.

Comments: Accepted at ICASSP 2019

arXiv:1803.08604 [pdf, other]

Learning State Representations for Query Optimization with Deep Reinforcement Learning

Authors: Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, S. Sathiya Keerthi

Abstract: Deep reinforcement learning is quickly changing the field of artificial intelligence. These models are able to capture a high level understanding of their environment, enabling them to learn difficult dynamic tasks in a variety of domains. In the database field, query optimization remains a difficult problem. Our goal in this work is to explore the capabilities of deep reinforcement learning in th… ▽ More Deep reinforcement learning is quickly changing the field of artificial intelligence. These models are able to capture a high level understanding of their environment, enabling them to learn difficult dynamic tasks in a variety of domains. In the database field, query optimization remains a difficult problem. Our goal in this work is to explore the capabilities of deep reinforcement learning in the context of query optimization. At each state, we build queries incrementally and encode properties of subqueries through a learned representation. The challenge here lies in the formation of the state transition function, which defines how the current subquery state combines with the next query operation (action) to yield the next state. As a first step in this direction, we focus the state representation problem and the formation of the state transition function. We describe our approach and show preliminary results. We further discuss how we can use the state representation to improve query optimization using reinforcement learning. △ Less

Submitted 22 March, 2018; originally announced March 2018.

arXiv:1803.04562 [pdf, other]

Bias in OLAP Queries: Detection, Explanation, and Removal

Authors: Babak Salimi, Johannes Gehrke, Dan Suciu

Abstract: On line analytical processing (OLAP) is an essential element of decision-support systems. OLAP tools provide insights and understanding needed for improved decision making. However, the answers to OLAP queries can be biased and lead to perplexing and incorrect insights. In this paper, we propose HypDB, a system to detect, explain, and to resolve bias in decision-support queries. We give a simple d… ▽ More On line analytical processing (OLAP) is an essential element of decision-support systems. OLAP tools provide insights and understanding needed for improved decision making. However, the answers to OLAP queries can be biased and lead to perplexing and incorrect insights. In this paper, we propose HypDB, a system to detect, explain, and to resolve bias in decision-support queries. We give a simple definition of a \emph{biased query}, which performs a set of independence tests on the data to detect bias. We propose a novel technique that gives explanations for bias, thus assisting an analyst in understanding what goes on. Additionally, we develop an automated method for rewriting a biased query into an unbiased query, which shows what the analyst intended to examine. In a thorough evaluation on several real datasets we show both the quality and the performance of our techniques, including the completely automatic discovery of the revolutionary insights from a famous 1973 discrimination case. △ Less

Submitted 24 July, 2018; v1 submitted 12 March, 2018; originally announced March 2018.

Comments: This paper is an extended version of a paper presented at SIGMOD 2018

arXiv:1802.09180 [pdf, other]

Cuttlefish: A Lightweight Primitive for Adaptive Query Processing

Authors: Tomer Kaftan, Magdalena Balazinska, Alvin Cheung, Johannes Gehrke

Abstract: Modern data processing applications execute increasingly sophisticated analysis that requires operations beyond traditional relational algebra. As a result, operators in query plans grow in diversity and complexity. Designing query optimizer rules and cost models to choose physical operators for all of these novel logical operators is impractical. To address this challenge, we develop Cuttlefish,… ▽ More Modern data processing applications execute increasingly sophisticated analysis that requires operations beyond traditional relational algebra. As a result, operators in query plans grow in diversity and complexity. Designing query optimizer rules and cost models to choose physical operators for all of these novel logical operators is impractical. To address this challenge, we develop Cuttlefish, a new primitive for adaptively processing online query plans that explores candidate physical operator instances during query execution and exploits the fastest ones using multi-armed bandit reinforcement learning techniques. We prototype Cuttlefish in Apache Spark and adaptively choose operators for image convolution, regular expression matching, and relational joins. Our experiments show Cuttlefish-based adaptive convolution and regular expression operators can reach 72-99% of the throughput of an all-knowing oracle that always selects the optimal algorithm, even when individual physical operators are up to 105x slower than the optimal. Additionally, Cuttlefish achieves join throughput improvements of up to 7.5x compared with Spark SQL's query optimizer. △ Less

Submitted 26 February, 2018; originally announced February 2018.

arXiv:1508.05347 [pdf, ps, other]

Pricing Queries Approximately Optimally

Authors: Vasilis Syrgkanis, Johannes Gehrke

Abstract: Data as a commodity has always been purchased and sold. Recently, web services that are data marketplaces have emerged that match data buyers with data sellers. So far there are no guidelines how to price queries against a database. We consider the recently proposed query-based pricing framework of Koutris et al and ask the question of computing optimal input prices in this framework by formulatin… ▽ More Data as a commodity has always been purchased and sold. Recently, web services that are data marketplaces have emerged that match data buyers with data sellers. So far there are no guidelines how to price queries against a database. We consider the recently proposed query-based pricing framework of Koutris et al and ask the question of computing optimal input prices in this framework by formulating a buyer utility model. We establish the interesting and deep equivalence between arbitrage-freeness in the query-pricing framework and envy-freeness in pricing theory for appropriately chosen buyer valuations. Given the approximation hardness results from envy-free pricing we then develop logarithmic approximation pricing algorithms exploiting the max flow interpretation of the arbitrage-free pricing for the restricted query language proposed by Koutris et al. We propose a novel polynomial-time logarithmic approximation pricing scheme and show that our new scheme performs better than the existing envy-free pricing algorithms instance-by-instance. We also present a faster pricing algorithm that is always greater than the existing solutions, but worse than our previous scheme. We experimentally show how our pricing algorithms perform with respect to the existing envy-free pricing algorithms and to the optimal exponentially computable solution, and our experiments show that our approximation algorithms consistently arrive at about 99% of the optimal. △ Less

Submitted 25 August, 2015; v1 submitted 21 August, 2015; originally announced August 2015.

arXiv:1412.7641 [pdf, ps, other]

Balancing Isolation and Sharing of Data for Third-Party Extensible App Ecosystems

Authors: Florian Schröder, Raphael M. Reischuk, Johannes Gehrke

Abstract: In the landscape of application ecosystems, today's cloud users wish to personalize not only their browsers with various extensions or their smartphones with various applications, but also the various extensions and applications themselves. The resulting personalization significantly raises the attractiveness for typical Web 2.0 users, but gives rise to various security risks and privacy concerns,… ▽ More In the landscape of application ecosystems, today's cloud users wish to personalize not only their browsers with various extensions or their smartphones with various applications, but also the various extensions and applications themselves. The resulting personalization significantly raises the attractiveness for typical Web 2.0 users, but gives rise to various security risks and privacy concerns, such as unforeseen access to certain critical components, undesired information flow of personal information to untrusted applications, or emerging attack surfaces that were not possible before a personalization has taken place. In this paper, we propose a novel extensibility mechanism which is used for implementing personalization of existing cloud applications towards (possibly untrusted) components in a secure and privacy-friendly manner. Our model provides a clean component abstraction, thereby in particular ruling out undesired component accesses and ensuring that no undesired information flow takes place between application components -- either trusted from the base application or untrusted from various extensions. We then instantiate our model in the SAFE web application framework (WWW 2012), resulting in a novel methodology that is inspired by traditional access control and specifically designed for the newly emerging needs of extensibility in application ecosystems. We illustrate the convenient usage of our techniques by showing how to securely extend an existing social network application. △ Less

Submitted 10 April, 2015; v1 submitted 24 December, 2014; originally announced December 2014.

arXiv:1407.4729 [pdf, other]

Sparse Partially Linear Additive Models

Authors: Yin Lou, Jacob Bien, Rich Caruana, Johannes Gehrke

Abstract: The generalized partially linear additive model (GPLAM) is a flexible and interpretable approach to building predictive models. It combines features in an additive manner, allowing each to have either a linear or nonlinear effect on the response. However, the choice of which features to treat as linear or nonlinear is typically assumed known. Thus, to make a GPLAM a viable approach in situations i… ▽ More The generalized partially linear additive model (GPLAM) is a flexible and interpretable approach to building predictive models. It combines features in an additive manner, allowing each to have either a linear or nonlinear effect on the response. However, the choice of which features to treat as linear or nonlinear is typically assumed known. Thus, to make a GPLAM a viable approach in situations in which little is known $a~priori$ about the features, one must overcome two primary model selection challenges: deciding which features to include in the model and determining which of these features to treat nonlinearly. We introduce the sparse partially linear additive model (SPLAM), which combines model fitting and $both$ of these model selection challenges into a single convex optimization problem. SPLAM provides a bridge between the lasso and sparse additive models. Through a statistical oracle inequality and thorough simulation, we demonstrate that SPLAM can outperform other methods across a broad spectrum of statistical regimes, including the high-dimensional ($p\gg N$) setting. We develop efficient algorithms that are applied to real data sets with half a million samples and over 45,000 features with excellent predictive performance. △ Less

Submitted 27 March, 2018; v1 submitted 17 July, 2014; originally announced July 2014.

Comments: Corrected typos

arXiv:1403.2307 [pdf, other]

The Homeostasis Protocol: Avoiding Transaction Coordination Through Program Analysis

Authors: Sudip Roy, Lucja Kot, Gabriel Bender, Bailu Ding, Hossein Hojjat, Christoph Koch, Nate Foster, Johannes Gehrke

Abstract: Datastores today rely on distribution and replication to achieve improved performance and fault-tolerance. But correctness of many applications depends on strong consistency properties - something that can impose substantial overheads, since it requires coordinating the behavior of multiple nodes. This paper describes a new approach to achieving strong consistency in distributed systems while mini… ▽ More Datastores today rely on distribution and replication to achieve improved performance and fault-tolerance. But correctness of many applications depends on strong consistency properties - something that can impose substantial overheads, since it requires coordinating the behavior of multiple nodes. This paper describes a new approach to achieving strong consistency in distributed systems while minimizing communication between nodes. The key insight is to allow the state of the system to be inconsistent during execution, as long as this inconsistency is bounded and does not affect transaction correctness. In contrast to previous work, our approach uses program analysis to extract semantic information about permissible levels of inconsistency and is fully automated. We then employ a novel homeostasis protocol to allow sites to operate independently, without communicating, as long as any inconsistency is governed by appropriate treaties between the nodes. We discuss mechanisms for optimizing treaties based on workload characteristics to minimize communication, as well as a prototype implementation and experiments that demonstrate the benefits of our approach on common transactional benchmarks. △ Less

Submitted 19 January, 2015; v1 submitted 10 March, 2014; originally announced March 2014.

arXiv:1311.2276 [pdf, ps, other]

A Quantitative Evaluation Framework for Missing Value Imputation Algorithms

Authors: Vinod Nair, Rahul Kidambi, Sundararajan Sellamanickam, S. Sathiya Keerthi, Johannes Gehrke, Vijay Narayanan

Abstract: We consider the problem of quantitatively evaluating missing value imputation algorithms. Given a dataset with missing values and a choice of several imputation algorithms to fill them in, there is currently no principled way to rank the algorithms using a quantitative metric. We develop a framework based on treating imputation evaluation as a problem of comparing two distributions and show how it… ▽ More We consider the problem of quantitatively evaluating missing value imputation algorithms. Given a dataset with missing values and a choice of several imputation algorithms to fill them in, there is currently no principled way to rank the algorithms using a quantitative metric. We develop a framework based on treating imputation evaluation as a problem of comparing two distributions and show how it can be used to compute quantitative metrics. We present an efficient procedure for applying this framework to practical datasets, demonstrate several metrics derived from the existing literature on comparing distributions, and propose a new metric called Neighborhood-based Dissimilarity Score which is fast to compute and provides similar results. Results are shown on several datasets, metrics, and imputations algorithms. △ Less

Submitted 10 November, 2013; originally announced November 2013.

Comments: 9 pages

arXiv:1208.0080 [pdf, other]

The Complexity of Social Coordination

Authors: Konstantinos Mamouras, Sigal Oren, Lior Seeman, Lucja Kot, Johannes Gehrke

Abstract: Coordination is a challenging everyday task; just think of the last time you organized a party or a meeting involving several people. As a growing part of our social and professional life goes online, an opportunity for an improved coordination process arises. Recently, Gupta et al. proposed entangled queries as a declarative abstraction for data-driven coordination, where the difficulty of the co… ▽ More Coordination is a challenging everyday task; just think of the last time you organized a party or a meeting involving several people. As a growing part of our social and professional life goes online, an opportunity for an improved coordination process arises. Recently, Gupta et al. proposed entangled queries as a declarative abstraction for data-driven coordination, where the difficulty of the coordination task is shifted from the user to the database. Unfortunately, evaluating entangled queries is very hard, and thus previous work considered only a restricted class of queries that satisfy safety (the coordination partners are fixed) and uniqueness (all queries need to be satisfied). In this paper we significantly extend the class of feasible entangled queries beyond uniqueness and safety. First, we show that we can simply drop uniqueness and still efficiently evaluate a set of safe entangled queries. Second, we show that as long as all users coordinate on the same set of attributes, we can give an efficient algorithm for coordination even if the set of queries does not satisfy safety. In an experimental evaluation we show that our algorithms are feasible for a wide spectrum of coordination scenarios. △ Less

Submitted 31 July, 2012; originally announced August 2012.

Comments: VLDB2012

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 11, pp. 1172-1183 (2012)

arXiv:1109.5111 [pdf, other]

Nerio: Leader Election and Edict Ordering

Authors: Robbert van Renesse, Fred B. Schneider, Johannes Gehrke

Abstract: Coordination in a distributed system is facilitated if there is a unique process, the leader, to manage the other processes. The leader creates edicts and sends them to other processes for execution or forwarding to other processes. The leader may fail, and when this occurs a leader election protocol selects a replacement. This paper describes Nerio, a class of such leader election protocols. Coordination in a distributed system is facilitated if there is a unique process, the leader, to manage the other processes. The leader creates edicts and sends them to other processes for execution or forwarding to other processes. The leader may fail, and when this occurs a leader election protocol selects a replacement. This paper describes Nerio, a class of such leader election protocols. △ Less

Submitted 26 September, 2011; v1 submitted 23 September, 2011; originally announced September 2011.

arXiv:1005.3773 [pdf, other]

Behavioral Simulations in MapReduce

Authors: Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White

Abstract: In many scientific domains, researchers are turning to large-scale behavioral simulations to better understand important real-world phenomena. While there has been a great deal of work on simulation tools from the high-performance computing community, behavioral simulations remain challenging to program and automatically scale in parallel environments. In this paper we present BRACE (Big Red Agent… ▽ More In many scientific domains, researchers are turning to large-scale behavioral simulations to better understand important real-world phenomena. While there has been a great deal of work on simulation tools from the high-performance computing community, behavioral simulations remain challenging to program and automatically scale in parallel environments. In this paper we present BRACE (Big Red Agent-based Computation Engine), which extends the MapReduce framework to process these simulations efficiently across a cluster. We can leverage spatial locality to treat behavioral simulations as iterated spatial joins and greatly reduce the communication between nodes. In our experiments we achieve nearly linear scale-up on several realistic simulations. Though processing behavioral simulations in parallel as iterated spatial joins can be very efficient, it can be much simpler for the domain scientists to program the behavior of a single agent. Furthermore, many simulations include a considerable amount of complex computation and message passing between agents, which makes it important to optimize the performance of a single node and the communication across nodes. To address both of these challenges, BRACE includes a high-level language called BRASIL (the Big Red Agent SImulation Language). BRASIL has object oriented features for programming simulations, but can be compiled to a data-flow representation for automatic parallelization and optimization. We show that by using various optimization techniques, we can achieve both scalability and single-node performance similar to that of a hand-coded simulation. △ Less

Submitted 20 May, 2010; originally announced May 2010.

arXiv:0909.5530 [pdf, ps, other]

Differential Privacy via Wavelet Transforms

Authors: Xiaokui Xiao, Guozhang Wang, Johannes Gehrke

Abstract: Privacy preserving data publishing has attracted considerable research interest in recent years. Among the existing solutions, {\em $ε$-differential privacy} provides one of the strongest privacy guarantees. Existing data publishing methods that achieve $ε$-differential privacy, however, offer little data utility. In particular, if the output dataset is used to answer count queries, the noise in… ▽ More Privacy preserving data publishing has attracted considerable research interest in recent years. Among the existing solutions, {\em $ε$-differential privacy} provides one of the strongest privacy guarantees. Existing data publishing methods that achieve $ε$-differential privacy, however, offer little data utility. In particular, if the output dataset is used to answer count queries, the noise in the query answers can be proportional to the number of tuples in the data, which renders the results useless. In this paper, we develop a data publishing technique that ensures $ε$-differential privacy while providing accurate answers for {\em range-count queries}, i.e., count queries where the predicate on each attribute is a range. The core of our solution is a framework that applies {\em wavelet transforms} on the data before adding noise to it. We present instantiations of the proposed framework for both ordinal and nominal data, and we provide a theoretical analysis on their privacy and utility guarantees. In an extensive experimental study on both real and synthetic data, we show the effectiveness and efficiency of our solution. △ Less

Submitted 30 September, 2009; originally announced September 2009.

arXiv:0909.1770 [pdf]

From Declarative Languages to Declarative Processing in Computer Games

Authors: Benjamin Sowell, Alan Demers, Johannes Gehrke, Nitin Gupta, Haoyuan Li, Walker White

Abstract: Recent work has shown that we can dramatically improve the performance of computer games and simulations through declarative processing: Character AI can be written in an imperative scripting language which is then compiled to relational algebra and executed by a special games engine with features similar to a main memory database system. In this paper we lay out a challenging research agenda bu… ▽ More Recent work has shown that we can dramatically improve the performance of computer games and simulations through declarative processing: Character AI can be written in an imperative scripting language which is then compiled to relational algebra and executed by a special games engine with features similar to a main memory database system. In this paper we lay out a challenging research agenda built on these ideas. We discuss several research ideas for novel language features to support atomic actions and reactive programming. We also explore challenges for main-memory query processing in games and simulations including adaptive query plan selection, support for parallel architectures, debugging simulation scripts, and extensions for multi-player games and virtual worlds. We believe that these research challenges will result in a dramatic change in the design of game engines over the next decade. △ Less

Submitted 9 September, 2009; originally announced September 2009.

Comments: CIDR 2009

arXiv:0904.0682 [pdf, ps, other]

Privacy in Search Logs

Authors: Michaela Goetz, Ashwin Machanavajjhala, Guozhang Wang, Xiaokui Xiao, Johannes Gehrke

Abstract: Search engine companies collect the "database of intentions", the histories of their users' search queries. These search logs are a gold mine for researchers. Search engine companies, however, are wary of publishing search logs in order not to disclose sensitive information. In this paper we analyze algorithms for publishing frequent keywords, queries and clicks of a search log. We first show how… ▽ More Search engine companies collect the "database of intentions", the histories of their users' search queries. These search logs are a gold mine for researchers. Search engine companies, however, are wary of publishing search logs in order not to disclose sensitive information. In this paper we analyze algorithms for publishing frequent keywords, queries and clicks of a search log. We first show how methods that achieve variants of $k$-anonymity are vulnerable to active attacks. We then demonstrate that the stronger guarantee ensured by $ε$-differential privacy unfortunately does not provide any utility for this problem. We then propose an algorithm ZEALOUS and show how to set its parameters to achieve $(ε,δ)$-probabilistic privacy. We also contrast our analysis of ZEALOUS with an analysis by Korolova et al. [17] that achieves $(ε',δ')$-indistinguishability. Our paper concludes with a large experimental study using real applications where we compare ZEALOUS and previous work that achieves $k$-anonymity in search log publishing. Our results show that ZEALOUS yields comparable utility to $k-$anonymity while at the same time achieving much stronger privacy guarantees. △ Less

Submitted 11 May, 2011; v1 submitted 4 April, 2009; originally announced April 2009.

arXiv:0809.0116 [pdf, ps, other]

doi 10.1109/ICDE.2008.4497432

Toward Expressive and Scalable Sponsored Search Auctions

Authors: David J. Martin, Johannes Gehrke, Joseph Y. Halpern

Abstract: Internet search results are a growing and highly profitable advertising platform. Search providers auction advertising slots to advertisers on their search result pages. Due to the high volume of searches and the users' low tolerance for search result latency, it is imperative to resolve these auctions fast. Current approaches restrict the expressiveness of bids in order to achieve fast winner d… ▽ More Internet search results are a growing and highly profitable advertising platform. Search providers auction advertising slots to advertisers on their search result pages. Due to the high volume of searches and the users' low tolerance for search result latency, it is imperative to resolve these auctions fast. Current approaches restrict the expressiveness of bids in order to achieve fast winner determination, which is the problem of allocating slots to advertisers so as to maximize the expected revenue given the advertisers' bids. The goal of our work is to permit more expressive bidding, thus allowing advertisers to achieve complex advertising goals, while still providing fast and scalable techniques for winner determination. △ Less

Submitted 31 August, 2008; originally announced September 2008.

Comments: 10 pages, 13 figures, ICDE 2008

ACM Class: K.4.4

Journal ref: David J. Martin, Johannes Gehrke, and Joseph Y. Halpern. Toward Expressive and Scalable Sponsored Search Auctions. In Proceedings of the 24th IEEE International Conference on Data Engineering, pages 237--246. April 2008

arXiv:0705.2787 [pdf, ps, other]

Worst-Case Background Knowledge for Privacy-Preserving Data Publishing

Authors: David J. Martin, Daniel Kifer, Ashwin Machanavajjhala, Johannes Gehrke, Joseph Y. Halpern

Abstract: Recent work has shown the necessity of considering an attacker's background knowledge when reasoning about privacy in data publishing. However, in practice, the data publisher does not know what background knowledge the attacker possesses. Thus, it is important to consider the worst-case. In this paper, we initiate a formal study of worst-case background knowledge. We propose a language that can… ▽ More Recent work has shown the necessity of considering an attacker's background knowledge when reasoning about privacy in data publishing. However, in practice, the data publisher does not know what background knowledge the attacker possesses. Thus, it is important to consider the worst-case. In this paper, we initiate a formal study of worst-case background knowledge. We propose a language that can express any background knowledge about the data. We provide a polynomial time algorithm to measure the amount of disclosure of sensitive information in the worst case, given that the attacker has at most a specified number of pieces of information in this language. We also provide a method to efficiently sanitize the data so that the amount of disclosure in the worst case is less than a specified threshold. △ Less

Submitted 18 May, 2007; originally announced May 2007.

Comments: 10 pages

arXiv:cs/0702012 [pdf]

doi 10.1109/ICDM.2006.126

Plagiarism Detection in arXiv

Authors: Daria Sorokina, Johannes Gehrke, Simeon Warner, Paul Ginsparg

Abstract: We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false… ▽ More We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger. △ Less

Submitted 1 February, 2007; originally announced February 2007.

Comments: Sixth International Conference on Data Mining (ICDM'06), Dec 2006

Showing 1–33 of 33 results for author: Gehrke, J