Search | arXiv e-print repository

An Empirical Evaluation of Columnar Storage Formats

Authors: Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, Huanchen Zhang

Abstract: Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both… ▽ More Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed. In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats' performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. We also point out the inefficiencies in the format designs when handling common machine learning workloads and using GPUs for decoding. Our analysis identified important considerations that may guide future formats to better fit modern technology trends. △ Less

Submitted 7 November, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

Comments: 15 pages; typos corrected, missing figure legend added

arXiv:2010.06760 [pdf, other]

Taurus: Lightweight Parallel Logging for In-Memory Database Management Systems (Extended Version)

Authors: Yu Xia, Xiangyao Yu, Andrew Pavlo, Srinivas Devadas

Abstract: Existing single-stream logging schemes are unsuitable for in-memory database management systems (DBMSs) as the single log is often a performance bottleneck. To overcome this problem, we present Taurus, an efficient parallel logging scheme that uses multiple log streams, and is compatible with both data and command logging. Taurus tracks and encodes transaction dependencies using a vector of log se… ▽ More Existing single-stream logging schemes are unsuitable for in-memory database management systems (DBMSs) as the single log is often a performance bottleneck. To overcome this problem, we present Taurus, an efficient parallel logging scheme that uses multiple log streams, and is compatible with both data and command logging. Taurus tracks and encodes transaction dependencies using a vector of log sequence numbers (LSNs). These vectors ensure that the dependencies are fully captured in logging and correctly enforced in recovery. Our experimental evaluation with an in-memory DBMS shows that Taurus's parallel logging achieves up to 9.9x and 2.9x speedups over single-streamed data logging and command logging, respectively. It also enables the DBMS to recover up to 22.9x and 75.6x faster than these baselines for data and command logging, respectively. We also compare Taurus with two state-of-the-art parallel logging schemes and show that the DBMS achieves up to 2.8x better performance on NVMe drives and 9.2x on HDDs. △ Less

Submitted 13 October, 2020; originally announced October 2020.

arXiv:2004.14471 [pdf, other]

Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats

Authors: Tianyu Li, Matthew Butrovich, Amadou Ngom, Wan Shen Lim, Wes McKinney, Andrew Pavlo

Abstract: The proliferation of modern data processing tools has given rise to open-source columnar data formats. The advantage of these formats is that they help organizations avoid repeatedly converting data to a new format for each application. These formats, however, are read-only, and organizations must use a heavy-weight transformation process to load data from on-line transactional processing (OLTP) s… ▽ More The proliferation of modern data processing tools has given rise to open-source columnar data formats. The advantage of these formats is that they help organizations avoid repeatedly converting data to a new format for each application. These formats, however, are read-only, and organizations must use a heavy-weight transformation process to load data from on-line transactional processing (OLTP) systems. We aim to reduce or even eliminate this process by developing a storage architecture for in-memory database management systems (DBMSs) that is aware of the eventual usage of its data and emits columnar storage blocks in a universal open-source format. We introduce relaxations to common analytical data formats to efficiently update records and rely on a lightweight transformation process to convert blocks to a read-optimized layout when they are cold. We also describe how to access data from third-party analytical tools with minimal serialization overhead. To evaluate our work, we implemented our storage engine based on the Apache Arrow format and integrated it into the DB-X DBMS. Our experiments show that our approach achieves comparable performance with dedicated OLTP DBMSs while enabling orders-of-magnitude faster data exports to external data science and machine learning tools than existing methods. △ Less

Submitted 29 April, 2020; originally announced April 2020.

Comments: 16 pages

arXiv:2004.09619 [pdf, other]

Vilamb: Low Overhead Asynchronous Redundancy for Direct Access NVM

Authors: Rajat Kateja, Andy Pavlo, Gregory R. Ganger

Abstract: Vilamb provides efficient asynchronous systemredundancy for direct access (DAX) non-volatile memory (NVM) storage. Production storage deployments often use system-redundancy in form of page checksums and cross-page parity. State-of-the-art solutions for maintaining system-redundancy for DAX NVM either incur a high performance overhead or require specialized hardware. The Vilamb user-space library… ▽ More Vilamb provides efficient asynchronous systemredundancy for direct access (DAX) non-volatile memory (NVM) storage. Production storage deployments often use system-redundancy in form of page checksums and cross-page parity. State-of-the-art solutions for maintaining system-redundancy for DAX NVM either incur a high performance overhead or require specialized hardware. The Vilamb user-space library maintains system-redundancy with low overhead by delaying and amortizing the system-redundancy updates over multiple data writes. As a result, Vilamb provides 3--5x the throughput of the state-of-the-art software solution at high operation rates. For applications that need system-redundancy with high performance, and can tolerate some delaying of data redundancy, Vilamb provides a tunable knob between performance and quicker redundancy. Even with the delayed coverage, Vilamb increases the mean time to data loss due to firmware-induced corruptions by up to two orders of magnitude in comparison to maintaining no system-redundancy. △ Less

Submitted 20 April, 2020; originally announced April 2020.

Report number: CMU-PDL-20-101

arXiv:2003.02391 [pdf, other]

Order-Preserving Key Compression for In-Memory Search Trees

Authors: Huanchen Zhang, Xiaoxuan Liu, David G. Andersen, Michael Kaminsky, Kimberly Keeton, Andrew Pavlo

Abstract: We present the High-speed Order-Preserving Encoder (HOPE) for in-memory search trees. HOPE is a fast dictionary-based compressor that encodes arbitrary keys while preserving their order. HOPE's approach is to identify common key patterns at a fine granularity and exploit the entropy to achieve high compression rates with a small dictionary. We first develop a theoretical model to reason about orde… ▽ More We present the High-speed Order-Preserving Encoder (HOPE) for in-memory search trees. HOPE is a fast dictionary-based compressor that encodes arbitrary keys while preserving their order. HOPE's approach is to identify common key patterns at a fine granularity and exploit the entropy to achieve high compression rates with a small dictionary. We first develop a theoretical model to reason about order-preserving dictionary designs. We then select six representative compression schemes using this model and implement them in HOPE. These schemes make different trade-offs between compression rate and encoding speed. We evaluate HOPE on five data structures used in databases: SuRF, ART, HOT, B+tree, and Prefix B+tree. Our experiments show that using HOPE allows the search trees to achieve lower query latency (up to 40\% lower) and better memory efficiency (up to 30\% smaller) simultaneously for most string key workloads. △ Less

Submitted 4 March, 2020; originally announced March 2020.

Comments: SIGMOD'20 version + Appendix

arXiv:1903.02990 [pdf, other]

Scheduling OLTP Transactions via Machine Learning

Authors: Yangjun Sheng, Anthony Tomasic, Tieying Zhang, Andrew Pavlo

Abstract: Current main memory database system architectures are still challenged by high contention workloads and this challenge will continue to grow as the number of cores in processors continues to increase. These systems schedule transactions randomly across cores to maximize concurrency and to produce a uniform load across cores. Scheduling never considers potential conflicts. Performance could be impr… ▽ More Current main memory database system architectures are still challenged by high contention workloads and this challenge will continue to grow as the number of cores in processors continues to increase. These systems schedule transactions randomly across cores to maximize concurrency and to produce a uniform load across cores. Scheduling never considers potential conflicts. Performance could be improved if scheduling balanced between concurrency to maximize throughput and scheduling transactions linearly to avoid conflicts. In this paper, we present the design of several intelligent transaction scheduling algorithms that consider both potential transaction conflicts and concurrency. To incorporate reasoning about transaction conflicts, we develop a supervised machine learning model that estimates the probability of conflict. This model is incorporated into several scheduling algorithms. In addition, we integrate an unsupervised machine learning algorithm into an intelligent scheduling algorithm. We then empirically measure the performance impact of different scheduling algorithms on OLTP and social networking workloads. Our results show that, with appropriate settings, intelligent scheduling can increase throughput by 54% and reduce abort rate by 80% on a 20-core machine, relative to random scheduling. In summary, the paper provides preliminary evidence that intelligent scheduling significantly improves DBMS performance. △ Less

Submitted 29 May, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

arXiv:1901.10938 [pdf, other]

Multi-Tier Buffer Management and Storage System Design for Non-Volatile Memory

Authors: Joy Arulraj, Andy Pavlo, Krishna Teja Malladi

Abstract: The design of the buffer manager in database management systems (DBMSs) is influenced by the performance characteristics of volatile memory (DRAM) and non-volatile storage (e.g., SSD). The key design assumptions have been that the data must be migrated to DRAM for the DBMS to operate on it and that storage is orders of magnitude slower than DRAM. But the arrival of new non-volatile memory (NVM) te… ▽ More The design of the buffer manager in database management systems (DBMSs) is influenced by the performance characteristics of volatile memory (DRAM) and non-volatile storage (e.g., SSD). The key design assumptions have been that the data must be migrated to DRAM for the DBMS to operate on it and that storage is orders of magnitude slower than DRAM. But the arrival of new non-volatile memory (NVM) technologies that are nearly as fast as DRAM invalidates these previous assumptions. This paper presents techniques for managing and designing a multi-tier storage hierarchy comprising of DRAM, NVM, and SSD. Our main technical contributions are a multi-tier buffer manager and a storage system designer that leverage the characteristics of NVM. We propose a set of optimizations for maximizing the utility of data migration between different devices in the storage hierarchy. We demonstrate that these optimizations have to be tailored based on device and workload characteristics. Given this, we present a technique for adapting these optimizations to achieve a near-optimal buffer management policy for an arbitrary workload and storage hierarchy without requiring any manual tuning. We finally present a recommendation system for designing a multi-tier storage hierarchy for a target workload and system cost budget. Our results show that the NVM-aware buffer manager and storage system designer improve throughput and reduce system cost across different transaction and analytical processing workloads. △ Less

Submitted 30 January, 2019; originally announced January 2019.

Comments: 16 pages

arXiv:1901.07064 [pdf, other]

Predictive Indexing

Authors: Joy Arulraj, Ran Xian, Lin Ma, Andrew Pavlo

Abstract: There has been considerable research on automated index tuning in database management systems (DBMSs). But the majority of these solutions tune the index configuration by retrospectively making computationally expensive physical design changes all at once. Such changes degrade the DBMS's performance during the process, and have reduced utility during subsequent query processing due to the delay be… ▽ More There has been considerable research on automated index tuning in database management systems (DBMSs). But the majority of these solutions tune the index configuration by retrospectively making computationally expensive physical design changes all at once. Such changes degrade the DBMS's performance during the process, and have reduced utility during subsequent query processing due to the delay between a workload shift and the associated change. A better approach is to generate small changes that tune the physical design over time, forecast the utility of these changes, and apply them ahead of time to maximize their impact. This paper presents predictive indexing that continuously improves a database's physical design using lightweight physical design changes. It uses a machine learning model to forecast the utility of these changes, and continuously refines the index configuration of the database to handle evolving workloads. We introduce a lightweight hybrid scan operator with which a DBMS can make use of partially-built indexes for query processing. Our evaluation shows that predictive indexing improves the throughput of a DBMS by 3.5--5.2x compared to other state-of-the-art indexing approaches. We demonstrate that predictive indexing works seamlessly with other lightweight automated physical design tuning methods. △ Less

Submitted 21 January, 2019; originally announced January 2019.

Comments: 12 pages

ACM Class: H.2.2; H.2.4

arXiv:1503.01143 [pdf, other]

S-Store: Streaming Meets Transaction Processing

Authors: John Meehan, Nesime Tatbul, Stan Zdonik, Cansu Aslantas, Ugur Cetintemel, Jiang Du, Tim Kraska, Samuel Madden, David Maier, Andrew Pavlo, Michael Stonebraker, Kristin Tufte, Hao Wang

Abstract: Stream processing addresses the needs of real-time applications. Transaction processing addresses the coordination and safety of short atomic computations. Heretofore, these two modes of operation existed in separate, stove-piped systems. In this work, we attempt to fuse the two computational paradigms in a single system called S-Store. In this way, S-Store can simultaneously accommodate OLTP and… ▽ More Stream processing addresses the needs of real-time applications. Transaction processing addresses the coordination and safety of short atomic computations. Heretofore, these two modes of operation existed in separate, stove-piped systems. In this work, we attempt to fuse the two computational paradigms in a single system called S-Store. In this way, S-Store can simultaneously accommodate OLTP and streaming applications. We present a simple transaction model for streams that integrates seamlessly with a traditional OLTP system. We chose to build S-Store as an extension of H-Store, an open-source, in-memory, distributed OLTP database system. By implementing S-Store in this way, we can make use of the transaction processing facilities that H-Store already supports, and we can concentrate on the additional implementation features that are needed to support streaming. Similar implementations could be done using other main-memory OLTP platforms. We show that we can actually achieve higher throughput for streaming workloads in S-Store than an equivalent deployment in H-Store alone. We also show how this can be achieved within H-Store with the addition of a modest amount of new functionality. Furthermore, we compare S-Store to two state-of-the-art streaming systems, Spark Streaming and Storm, and show how S-Store matches and sometimes exceeds their performance while providing stronger transactional guarantees. △ Less

Submitted 10 March, 2015; v1 submitted 3 March, 2015; originally announced March 2015.

arXiv:1110.6647 [pdf, other]

On Predictive Modeling for Optimizing Transaction Execution in Parallel OLTP Systems

Authors: Andrew Pavlo, Evan P. C. Jones, Stanley Zdonik

Abstract: A new emerging class of parallel database management systems (DBMS) is designed to take advantage of the partitionable workloads of on-line transaction processing (OLTP) applications. Transactions in these systems are optimized to execute to completion on a single node in a shared-nothing cluster without needing to coordinate with other nodes or use expensive concurrency control measures. But some… ▽ More A new emerging class of parallel database management systems (DBMS) is designed to take advantage of the partitionable workloads of on-line transaction processing (OLTP) applications. Transactions in these systems are optimized to execute to completion on a single node in a shared-nothing cluster without needing to coordinate with other nodes or use expensive concurrency control measures. But some OLTP applications cannot be partitioned such that all of their transactions execute within a single-partition in this manner. These distributed transactions access data not stored within their local partitions and subsequently require more heavy-weight concurrency control protocols. Further difficulties arise when the transaction's execution properties, such as the number of partitions it may need to access or whether it will abort, are not known beforehand. The DBMS could mitigate these performance issues if it is provided with additional information about transactions. Thus, in this paper we present a Markov model-based approach for automatically selecting which optimizations a DBMS could use, namely (1) more efficient concurrency control schemes, (2) intelligent scheduling, (3) reduced undo logging, and (4) speculative execution. To evaluate our techniques, we implemented our models and integrated them into a parallel, main-memory OLTP DBMS to show that we can improve the performance of applications with diverse workloads. △ Less

Submitted 30 October, 2011; originally announced October 2011.

Comments: VLDB2012

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 2, pp. 85-96 (2011)

arXiv:1101.0350 [pdf, ps, other]

Graffiti Networks: A Subversive, Internet-Scale File Sharing Model

Authors: Andrew Pavlo, Ning Shi

Abstract: The proliferation of peer-to-peer (P2P) file sharing protocols is due to their efficient and scalable methods for data dissemination to numerous users. But many of these networks have no provisions to provide users with long term access to files after the initial interest has diminished, nor are they able to guarantee protection for users from malicious clients that wish to implicate them in incri… ▽ More The proliferation of peer-to-peer (P2P) file sharing protocols is due to their efficient and scalable methods for data dissemination to numerous users. But many of these networks have no provisions to provide users with long term access to files after the initial interest has diminished, nor are they able to guarantee protection for users from malicious clients that wish to implicate them in incriminating activities. As such, users may turn to supplementary measures for storing and transferring data in P2P systems. We present a new file sharing paradigm, called a Graffiti Network, which allows peers to harness the potentially unlimited storage of the Internet as a third-party intermediary. Our key contributions in this paper are (1) an overview of a distributed system based on this new threat model and (2) a measurement of its viability through a one-year deployment study using a popular web-publishing platform. The results of this experiment motivate a discussion about the challenges of mitigating this type of file sharing in a hostile network environment and how web site operators can protect their resources. △ Less

Submitted 1 January, 2011; originally announced January 2011.

arXiv:cs/0606007 [pdf, ps, other]

A parent-centered radial layout algorithm for interactive graph visualization and animation

Authors: Andrew Pavlo, Christopher Homan, Jonathan Schull

Abstract: We have developed (1) a graph visualization system that allows users to explore graphs by viewing them as a succession of spanning trees selected interactively, (2) a radial graph layout algorithm, and (3) an animation algorithm that generates meaningful visualizations and smooth transitions between graphs while minimizing edge crossings during transitions and in static layouts. Our system is… ▽ More We have developed (1) a graph visualization system that allows users to explore graphs by viewing them as a succession of spanning trees selected interactively, (2) a radial graph layout algorithm, and (3) an animation algorithm that generates meaningful visualizations and smooth transitions between graphs while minimizing edge crossings during transitions and in static layouts. Our system is similar to the radial layout system of Yee et al. (2001), but differs primarily in that each node is positioned on a coordinate system centered on its own parent rather than on a single coordinate system for all nodes. Our system is thus easy to define recursively and lends itself to parallelization. It also guarantees that layouts have many nice properties, such as: it guarantees certain edges never cross during an animation. We compared the layouts and transitions produced by our algorithms to those produced by Yee et al. Results from several experiments indicate that our system produces fewer edge crossings during transitions between graph drawings, and that the transitions more often involve changes in local scaling rather than structure. These findings suggest the system has promise as an interactive graph exploration tool in a variety of settings. △ Less

Submitted 1 June, 2006; originally announced June 2006.

ACM Class: I.3.3; H.5.0

Showing 1–12 of 12 results for author: Pavlo, A