-
BaCO: A Fast and Portable Bayesian Compiler Optimization Framework
Authors:
Erik Hellsten,
Artur Souza,
Johannes Lenfers,
Rubens Lacouture,
Olivia Hsu,
Adel Ejjeh,
Fredrik Kjolstad,
Michel Steuwer,
Kunle Olukotun,
Luigi Nardi
Abstract:
We introduce the Bayesian Compiler Optimization framework (BaCO), a general purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO provides the flexibility needed to handle the requirements of modern autotuning tasks. Particularly, it deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. To reason about these…
▽ More
We introduce the Bayesian Compiler Optimization framework (BaCO), a general purpose autotuner for modern compilers targeting CPUs, GPUs, and FPGAs. BaCO provides the flexibility needed to handle the requirements of modern autotuning tasks. Particularly, it deals with permutation, ordered, and continuous parameter types along with both known and unknown parameter constraints. To reason about these parameter types and efficiently deliver high-quality code, BaCO uses Bayesian optimiza tion algorithms specialized towards the autotuning domain. We demonstrate BaCO's effectiveness on three modern compiler systems: TACO, RISE & ELEVATE, and HPVM2FPGA for CPUs, GPUs, and FPGAs respectively. For these domains, BaCO outperforms current state-of-the-art autotuners by delivering on average 1.36x-1.56x faster code with a tiny search budget, and BaCO is able to reach expert-level performance 2.9x-3.9x faster.
△ Less
Submitted 11 April, 2023; v1 submitted 1 December, 2022;
originally announced December 2022.
-
Trireme: Exploring Hierarchical Multi-Level Parallelism for Domain Specific Hardware Acceleration
Authors:
Georgios Zacharopoulos,
Adel Ejjeh,
Ying Jing,
En-Yu Yang,
Tianyu Jia,
Iulian Brumar,
Jeremy Intan,
Muhammad Huzaifa,
Sarita Adve,
Vikram Adve,
Gu-Yeon Wei,
David Brooks
Abstract:
The design of heterogeneous systems that include domain specific accelerators is a challenging and time-consuming process. While taking into account area constraints, designers must decide which parts of an application to accelerate in hardware and which to leave in software. Moreover, applications in domains such as Extended Reality (XR) offer opportunities for various forms of parallel execution…
▽ More
The design of heterogeneous systems that include domain specific accelerators is a challenging and time-consuming process. While taking into account area constraints, designers must decide which parts of an application to accelerate in hardware and which to leave in software. Moreover, applications in domains such as Extended Reality (XR) offer opportunities for various forms of parallel execution, including loop level, task level and pipeline parallelism. To assist the design process and expose every possible level of parallelism, we present Trireme, a fully automated tool-chain that explores multiple levels of parallelism and produces domain specific accelerator designs and configurations that maximize performance, given an area budget. Experiments on demanding benchmarks from the XR domain revealed a speedup of up to 20x, as well as a speedup of up to 37x for smaller applications, compared to software-only implementations.
△ Less
Submitted 21 January, 2022;
originally announced January 2022.
-
Studying the Potential of Automatic Optimizations in the Intel FPGA SDK for OpenCL
Authors:
Adel Ejjeh,
Vikram Adve,
Rob Rutenbar
Abstract:
High Level Synthesis (HLS) tools, like the Intel FPGA SDK for OpenCL, improve design productivity and enable efficient design space exploration guided by simple program directives (pragmas), but may sometimes miss important optimizations necessary for high performance. In this paper, we present a study of the tradeoffs in HLS optimizations, and the potential of a modern HLS tool in automatically o…
▽ More
High Level Synthesis (HLS) tools, like the Intel FPGA SDK for OpenCL, improve design productivity and enable efficient design space exploration guided by simple program directives (pragmas), but may sometimes miss important optimizations necessary for high performance. In this paper, we present a study of the tradeoffs in HLS optimizations, and the potential of a modern HLS tool in automatically optimizing an application. We perform the study on a 5-stage camera ISP pipeline using the Intel FPGA SDK for OpenCL and an Arria 10 FPGA Dev Kit. We show that automatic optimizations in the HLS tool are valuable, achieving a up to 2.7X speedup over equivalent CPU execution. With further hand tuning, however, we can achieve up to 36.5X speedup over CPU. We draw several specific lessons about the effectiveness of automatic optimizations guided by simple directives, and the nature of manual rewriting required for high performance.
△ Less
Submitted 10 January, 2022;
originally announced January 2022.