-
What can Data-Centric AI Learn from Data and ML Engineering?
Authors:
Neoklis Polyzotis,
Matei Zaharia
Abstract:
Data-centric AI is a new and exciting research topic in the AI community, but many organizations already build and maintain various "data-centric" applications whose goal is to produce high quality data. These range from traditional business data processing applications (e.g., "how much should we charge each of our customers this month?") to production ML systems such as recommendation engines. Th…
▽ More
Data-centric AI is a new and exciting research topic in the AI community, but many organizations already build and maintain various "data-centric" applications whose goal is to produce high quality data. These range from traditional business data processing applications (e.g., "how much should we charge each of our customers this month?") to production ML systems such as recommendation engines. The fields of data and ML engineering have arisen in recent years to manage these applications, and both include many interesting novel tools and processes. In this paper, we discuss several lessons from data and ML engineering that could be interesting to apply in data-centric AI, based on our experience building data and ML platforms that serve thousands of applications at a range of organizations.
△ Less
Submitted 13 December, 2021;
originally announced December 2021.
-
Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities
Authors:
Doris Xin,
Hui Miao,
Aditya Parameswaran,
Neoklis Polyzotis
Abstract:
Machine learning (ML) is now commonplace, powering data-driven applications in various organizations. Unlike the traditional perception of ML in research, ML production pipelines are complex, with many interlocking analytical components beyond training, whose sub-parts are often run multiple times on overlapping subsets of data. However, there is a lack of quantitative evidence regarding the lifes…
▽ More
Machine learning (ML) is now commonplace, powering data-driven applications in various organizations. Unlike the traditional perception of ML in research, ML production pipelines are complex, with many interlocking analytical components beyond training, whose sub-parts are often run multiple times on overlapping subsets of data. However, there is a lack of quantitative evidence regarding the lifespan, architecture, frequency, and complexity of these pipelines to understand how data management research can be used to make them more efficient, effective, robust, and reproducible. To that end, we analyze the provenance graphs of 3000 production ML pipelines at Google, comprising over 450,000 models trained, spanning a period of over four months, in an effort to understand the complexity and challenges underlying production ML. Our analysis reveals the characteristics, components, and topologies of typical industry-strength ML pipelines at various granularities. Along the way, we introduce a specialized data model for representing and reasoning about repeatedly run components in these ML pipelines, which we call model graphlets. We identify several rich opportunities for optimization, leveraging traditional data management ideas. We show how targeting even one of these opportunities, i.e., identifying and pruning wasted computation that does not translate to model deployment, can reduce wasted computation cost by 50% without compromising the model deployment cadence.
△ Less
Submitted 29 March, 2021;
originally announced March 2021.
-
Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX)
Authors:
Konstantinos,
Katsiapis,
Abhijit Karmarkar,
Ahmet Altay,
Aleksandr Zaks,
Neoklis Polyzotis,
Anusha Ramesh,
Ben Mathes,
Gautam Vasudevan,
Irene Giannoumis,
Jarek Wilkiewicz,
Jiri Simsa,
Justin Hong,
Mitch Trott,
Noé Lutz,
Pavel A. Dournov,
Robert Crowe,
Sarah Sirajuddin,
Tris Brian Warkentin,
Zhitao Li
Abstract:
Software Engineering, as a discipline, has matured over the past 5+ decades. The modern world heavily depends on it, so the increased maturity of Software Engineering was an eventuality. Practices like testing and reliable technologies help make Software Engineering reliable enough to build industries upon. Meanwhile, Machine Learning (ML) has also grown over the past 2+ decades. ML is used more a…
▽ More
Software Engineering, as a discipline, has matured over the past 5+ decades. The modern world heavily depends on it, so the increased maturity of Software Engineering was an eventuality. Practices like testing and reliable technologies help make Software Engineering reliable enough to build industries upon. Meanwhile, Machine Learning (ML) has also grown over the past 2+ decades. ML is used more and more for research, experimentation and production workloads. ML now commonly powers widely-used products integral to our lives. But ML Engineering, as a discipline, has not widely matured as much as its Software Engineering ancestor. Can we take what we have learned and help the nascent field of applied ML evolve into ML Engineering the way Programming evolved into Software Engineering [1]? In this article we will give a whirlwind tour of Sibyl [2] and TensorFlow Extended (TFX) [3], two successive end-to-end (E2E) ML platforms at Alphabet. We will share the lessons learned from over a decade of applied ML built on these platforms, explain both their similarities and their differences, and expand on the shifts (both mental and technical) that helped us on our journey. In addition, we will highlight some of the capabilities of TFX that help realize several aspects of ML Engineering. We argue that in order to unlock the gains ML can bring, organizations should advance the maturity of their ML teams by investing in robust ML infrastructure and promoting ML Engineering education. We also recommend that before focusing on cutting-edge ML modeling techniques, product leaders should invest more time in adopting interoperable ML platforms for their organizations. In closing, we will also share a glimpse into the future of TFX.
△ Less
Submitted 7 October, 2020; v1 submitted 28 September, 2020;
originally announced October 2020.
-
Improving Differentially Private Models with Active Learning
Authors:
Zhengli Zhao,
Nicolas Papernot,
Sameer Singh,
Neoklis Polyzotis,
Augustus Odena
Abstract:
Broad adoption of machine learning techniques has increased privacy concerns for models trained on sensitive data such as medical records. Existing techniques for training differentially private (DP) models give rigorous privacy guarantees, but applying these techniques to neural networks can severely degrade model performance. This performance reduction is an obstacle to deploying private models…
▽ More
Broad adoption of machine learning techniques has increased privacy concerns for models trained on sensitive data such as medical records. Existing techniques for training differentially private (DP) models give rigorous privacy guarantees, but applying these techniques to neural networks can severely degrade model performance. This performance reduction is an obstacle to deploying private models in the real world. In this work, we improve the performance of DP models by fine-tuning them through active learning on public data. We introduce two new techniques - DIVERSEPUBLIC and NEARPRIVATE - for doing this fine-tuning in a privacy-aware way. For the MNIST and SVHN datasets, these techniques improve state-of-the-art accuracy for DP models while retaining privacy guarantees.
△ Less
Submitted 2 October, 2019;
originally announced October 2019.
-
Automated Data Slicing for Model Validation:A Big data - AI Integration Approach
Authors:
Yeounoh Chung,
Tim Kraska,
Neoklis Polyzotis,
Ki Hyun Tae,
Steven Euijong Whang
Abstract:
As machine learning systems become democratized, it becomes increasingly important to help users easily debug their models. However, current data tools are still primitive when it comes to helping users trace model performance problems all the way to the data. We focus on the particular problem of slicing data to identify subsets of the validation data where the model performs poorly. This is an i…
▽ More
As machine learning systems become democratized, it becomes increasingly important to help users easily debug their models. However, current data tools are still primitive when it comes to helping users trace model performance problems all the way to the data. We focus on the particular problem of slicing data to identify subsets of the validation data where the model performs poorly. This is an important problem in model validation because the overall model performance can fail to reflect that of the smaller subsets, and slicing allows users to analyze the model performance on a more granular-level. Unlike general techniques (e.g., clustering) that can find arbitrary slices, our goal is to find interpretable slices (which are easier to take action compared to arbitrary subsets) that are problematic and large. We propose Slice Finder, which is an interactive framework for identifying such slices using statistical techniques. Applications include diagnosing model fairness and fraud detection, where identifying slices that are interpretable to humans is crucial. This research is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.
△ Less
Submitted 6 January, 2019; v1 submitted 16 July, 2018;
originally announced July 2018.
-
The Case for Learned Index Structures
Authors:
Tim Kraska,
Alex Beutel,
Ed H. Chi,
Jeffrey Dean,
Neoklis Polyzotis
Abstract:
Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be…
▽ More
Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.
△ Less
Submitted 30 April, 2018; v1 submitted 4 December, 2017;
originally announced December 2017.
-
Towards a Workload for Evolutionary Analytics
Authors:
Jeff LeFevre,
Jagan Sankaranarayanan,
Hakan Hacigumus,
Junichi Tatemura,
Neoklis Polyzotis
Abstract:
Emerging data analysis involves the ingestion and exploration of new data sets, application of complex functions, and frequent query revisions based on observing prior query answers. We call this new type of analysis evolutionary analytics and identify its properties. This type of analysis is not well represented by current benchmark workloads. In this paper, we present a workload and identify sev…
▽ More
Emerging data analysis involves the ingestion and exploration of new data sets, application of complex functions, and frequent query revisions based on observing prior query answers. We call this new type of analysis evolutionary analytics and identify its properties. This type of analysis is not well represented by current benchmark workloads. In this paper, we present a workload and identify several metrics to test system support for evolutionary analytics. Along with our metrics, we present methodologies for running the workload that capture this analytical scenario.
△ Less
Submitted 27 June, 2013; v1 submitted 5 April, 2013;
originally announced April 2013.
-
RITA: An Index-Tuning Advisor for Replicated Databases
Authors:
Quoc Trung Tran,
Ivo Jimenez,
Rui Wang,
Neoklis Polyzotis,
Anastasia Ailamaki
Abstract:
Given a replicated database, a divergent design tunes the indexes in each replica differently in order to specialize it for a specific subset of the workload. This specialization brings significant performance gains compared to the common practice of having the same indexes in all replicas, but requires the development of new tuning tools for database administrators. In this paper we introduce RIT…
▽ More
Given a replicated database, a divergent design tunes the indexes in each replica differently in order to specialize it for a specific subset of the workload. This specialization brings significant performance gains compared to the common practice of having the same indexes in all replicas, but requires the development of new tuning tools for database administrators. In this paper we introduce RITA (Replication-aware Index Tuning Advisor), a novel divergent-tuning advisor that offers several essential features not found in existing tools: it generates robust divergent designs that allow the system to adapt gracefully to replica failures; it computes designs that spread the load evenly among specialized replicas, both during normal operation and when replicas fail; it monitors the workload online in order to detect changes that require a recomputation of the divergent design; and, it offers suggestions to elastically reconfigure the system (by adding/removing replicas or adding/dropping indexes) to respond to workload changes. The key technical innovation behind RITA is showing that the problem of selecting an optimal design can be formulated as a Binary Integer Program (BIP). The BIP has a relatively small number of variables, which makes it feasible to solve it efficiently using any off-the-shelf linear-optimization software. Experimental results demonstrate that RITA computes better divergent designs compared to existing tools, offers more features, and has fast execution times.
△ Less
Submitted 19 July, 2013; v1 submitted 4 April, 2013;
originally announced April 2013.
-
Exploiting Opportunistic Physical Design in Large-scale Data Analytics
Authors:
Jeff LeFevre,
Jagan Sankaranarayanan,
Hakan Hacigumus,
Junichi Tatemura,
Neoklis Polyzotis,
Michael J. Carey
Abstract:
Large-scale systems, such as MapReduce and Hadoop, perform aggressive materialization of intermediate job results in order to support fault tolerance. When jobs correspond to exploratory queries submitted by data analysts, these materializations yield a large set of materialized views that typically capture common computation among successive queries from the same analyst, or even across queries o…
▽ More
Large-scale systems, such as MapReduce and Hadoop, perform aggressive materialization of intermediate job results in order to support fault tolerance. When jobs correspond to exploratory queries submitted by data analysts, these materializations yield a large set of materialized views that typically capture common computation among successive queries from the same analyst, or even across queries of different analysts who test similar hypotheses. We propose to treat these views as an opportunistic physical design and use them for the purpose of query optimization. We develop a novel query-rewrite algorithm that addresses the two main challenges in this context: how to search the large space of rewrites, and how to reason about views that contain UDFs (a common feature in large-scale data analytics). The algorithm, which provably finds the minimum-cost rewrite, is inspired by nearest-neighbor searches in non-metric spaces. We present an extensive experimental study on real-world datasets with a prototype data-analytics system based on Hive. The results demonstrate that our approach can result in dramatic performance improvements on complex data-analysis queries, reducing total execution time by an average of 61% and up to two orders of magnitude.
△ Less
Submitted 10 December, 2013; v1 submitted 26 March, 2013;
originally announced March 2013.
-
Iterative MapReduce for Large Scale Machine Learning
Authors:
Joshua Rosen,
Neoklis Polyzotis,
Vinayak Borkar,
Yingyi Bu,
Michael J. Carey,
Markus Weimer,
Tyson Condie,
Raghu Ramakrishnan
Abstract:
Large datasets ("Big Data") are becoming ubiquitous because the potential value in deriving insights from data, across a wide range of business and scientific applications, is increasingly recognized. In particular, machine learning - one of the foundational disciplines for data analysis, summarization and inference - on Big Data has become routine at most organizations that operate large clouds,…
▽ More
Large datasets ("Big Data") are becoming ubiquitous because the potential value in deriving insights from data, across a wide range of business and scientific applications, is increasingly recognized. In particular, machine learning - one of the foundational disciplines for data analysis, summarization and inference - on Big Data has become routine at most organizations that operate large clouds, usually based on systems such as Hadoop that support the MapReduce programming paradigm. It is now widely recognized that while MapReduce is highly scalable, it suffers from a critical weakness for machine learning: it does not support iteration. Consequently, one has to program around this limitation, leading to fragile, inefficient code. Further, reliance on the programmer is inherently flawed in a multi-tenanted cloud environment, since the programmer does not have visibility into the state of the system when his or her program executes. Prior work has sought to address this problem by either developing specialized systems aimed at stylized applications, or by augmenting MapReduce with ad hoc support for saving state across iterations (driven by an external loop). In this paper, we advocate support for looping as a first-class construct, and propose an extension of the MapReduce programming paradigm called {\em Iterative MapReduce}. We then develop an optimizer for a class of Iterative MapReduce programs that cover most machine learning techniques, provide theoretical justifications for the key optimization steps, and empirically demonstrate that system-optimized programs for significant machine learning tasks are competitive with state-of-the-art specialized solutions.
△ Less
Submitted 13 March, 2013;
originally announced March 2013.
-
Scaling Datalog for Machine Learning on Big Data
Authors:
Yingyi Bu,
Vinayak Borkar,
Michael J. Carey,
Joshua Rosen,
Neoklis Polyzotis,
Tyson Condie,
Markus Weimer,
Raghu Ramakrishnan
Abstract:
In this paper, we present the case for a declarative foundation for data-intensive machine learning systems. Instead of creating a new system for each specific flavor of machine learning task, or hardcoding new optimizations, we argue for the use of recursive queries to program a variety of machine learning systems. By taking this approach, database query optimization techniques can be utilized to…
▽ More
In this paper, we present the case for a declarative foundation for data-intensive machine learning systems. Instead of creating a new system for each specific flavor of machine learning task, or hardcoding new optimizations, we argue for the use of recursive queries to program a variety of machine learning systems. By taking this approach, database query optimization techniques can be utilized to identify effective execution plans, and the resulting runtime plans can be executed on a single unified data-parallel query processing engine. As a proof of concept, we consider two programming models--Pregel and Iterative Map-Reduce-Update---from the machine learning domain, and show how they can be captured in Datalog, tuned for a specific task, and then compiled into an optimized physical plan. Experiments performed on a large computing cluster with real data demonstrate that this declarative approach can provide very good performance while offering both increased generality and programming ease.
△ Less
Submitted 2 March, 2012; v1 submitted 1 March, 2012;
originally announced March 2012.
-
CoPhy: A Scalable, Portable, and Interactive Index Advisor for Large Workloads
Authors:
Debabrata Dash,
Neoklis Polyzotis,
Anastasia Ailamaki
Abstract:
Index tuning, i.e., selecting the indexes appropriate for a workload, is a crucial problem in database system tuning. In this paper, we solve index tuning for large problem instances that are common in practice, e.g., thousands of queries in the workload, thousands of candidate indexes and several hard and soft constraints. Our work is the first to reveal that the index tuning problem has a well s…
▽ More
Index tuning, i.e., selecting the indexes appropriate for a workload, is a crucial problem in database system tuning. In this paper, we solve index tuning for large problem instances that are common in practice, e.g., thousands of queries in the workload, thousands of candidate indexes and several hard and soft constraints. Our work is the first to reveal that the index tuning problem has a well structured space of solutions, and this space can be explored efficiently with well known techniques from linear optimization. Experimental results demonstrate that our approach outperforms state-of-the-art commercial and research techniques by a significant margin (up to an order of magnitude).
△ Less
Submitted 16 April, 2011;
originally announced April 2011.
-
Human-Assisted Graph Search: It's Okay to Ask Questions
Authors:
Aditya Parameswaran,
Anish Das Sarma,
Hector Garcia-Molina,
Neoklis Polyzotis,
Jennifer Widom
Abstract:
We consider the problem of human-assisted graph search: given a directed acyclic graph with some (unknown) target node(s), we consider the problem of finding the target node(s) by asking an omniscient human questions of the form "Is there a target node that is reachable from the current node?". This general problem has applications in many domains that can utilize human intelligence, including cur…
▽ More
We consider the problem of human-assisted graph search: given a directed acyclic graph with some (unknown) target node(s), we consider the problem of finding the target node(s) by asking an omniscient human questions of the form "Is there a target node that is reachable from the current node?". This general problem has applications in many domains that can utilize human intelligence, including curation of hierarchies, debugging workflows, image segmentation and categorization, interactive search and filter synthesis. To our knowledge, this work provides the first formal algorithmic study of the optimization of human computation for this problem. We study various dimensions of the problem space, providing algorithms and complexity results. Our framework and algorithms can be used in the design of an optimizer for crowd-sourcing platforms such as Mechanical Turk.
△ Less
Submitted 16 March, 2011;
originally announced March 2011.
-
Semi-Automatic Index Tuning: Keeping DBAs in the Loop
Authors:
Karl Schnaitter,
Neoklis Polyzotis
Abstract:
To obtain good system performance, a DBA must choose a set of indices that is appropriate for the workload. The system can aid in this challenging task by providing recommendations for the index configuration. We propose a new index recommendation technique, termed semi-automatic tuning, that keeps the DBA "in the loop" by generating recommendations that use feedback about the DBA's preferences. T…
▽ More
To obtain good system performance, a DBA must choose a set of indices that is appropriate for the workload. The system can aid in this challenging task by providing recommendations for the index configuration. We propose a new index recommendation technique, termed semi-automatic tuning, that keeps the DBA "in the loop" by generating recommendations that use feedback about the DBA's preferences. The technique also works online, which avoids the limitations of commercial tools that require the workload to be known in advance. The foundation of our approach is the Work Function Algorithm, which can solve a wide variety of online optimization problems with strong competitive guarantees. We present an experimental analysis that validates the benefits of semi-automatic tuning in a wide variety of conditions.
△ Less
Submitted 30 October, 2011; v1 submitted 8 April, 2010;
originally announced April 2010.