Search | arXiv e-print repository

QuCLEAR: Clifford Extraction and Absorption for Significant Reduction in Quantum Circuit Size

Authors: Ji Liu, Alvin Gonzales, Benchen Huang, Zain Hamid Saleem, Paul Hovland

Abstract: Quantum computing carries significant potential for addressing practical problems. However, currently available quantum devices suffer from noisy quantum gates, which degrade the fidelity of executed quantum circuits. Therefore, quantum circuit optimization is crucial for obtaining useful results. In this paper, we present QuCLEAR, a compilation framework designed to optimize quantum circuits. QuC… ▽ More Quantum computing carries significant potential for addressing practical problems. However, currently available quantum devices suffer from noisy quantum gates, which degrade the fidelity of executed quantum circuits. Therefore, quantum circuit optimization is crucial for obtaining useful results. In this paper, we present QuCLEAR, a compilation framework designed to optimize quantum circuits. QuCLEAR significantly reduces both the two-qubit gate count and the circuit depth through two novel optimization steps. First, we introduce the concept of Clifford Extraction, which extracts Clifford subcircuits to the end of the circuit while optimizing the gates. Second, since Clifford circuits are classically simulatable, we propose Clifford Absorption, which efficiently processes the extracted Clifford subcircuits classically. We demonstrate our framework on quantum simulation circuits, which have wide-ranging applications in quantum chemistry simulation, many-body physics, and combinatorial optimization problems. Near-term algorithms such as VQE and QAOA also fall within this category. Experimental results across various benchmarks show that QuCLEAR achieves up to a $77.7\%$ reduction in CNOT gate count and up to an $84.1\%$ reduction in entangling depth compared to state-of-the-art methods. △ Less

Submitted 23 August, 2024; originally announced August 2024.

Comments: 13 pages, 9 figures, 2 tables

arXiv:2404.17039 [pdf, other]

Differentiating Through Linear Solvers

Authors: Paul Hovland, Jan Hückelheim

Abstract: Computer programs containing calls to linear solvers are a known challenge for automatic differentiation. Previous publications advise against differentiating through the low-level solver implementation, and instead advocate for high-level approaches that express the derivative in terms of a modified linear system that can be solved with a separate solver call. Despite this ubiquitous advice, we a… ▽ More Computer programs containing calls to linear solvers are a known challenge for automatic differentiation. Previous publications advise against differentiating through the low-level solver implementation, and instead advocate for high-level approaches that express the derivative in terms of a modified linear system that can be solved with a separate solver call. Despite this ubiquitous advice, we are not aware of prior work comparing the accuracy of both approaches. With this article we thus empirically study a simple question: What happens if we ignore common wisdom, and differentiate through linear solvers? △ Less

Submitted 6 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

arXiv:2402.09222 [pdf, other]

Integrating ytopt and libEnsemble to Autotune OpenMC

Authors: Xingfu Wu, John R. Tramm, Jeffrey Larson, John-Luke Navarro, Prasanna Balaprakash, Brice Videau, Michael Kruse, Paul Hovland, Valerie Taylor, Mary Hall

Abstract: ytopt is a Python machine-learning-based autotuning software package developed within the ECP PROTEAS-TUNE project. The ytopt software adopts an asynchronous search framework that consists of sampling a small number of input parameter configurations and progressively fitting a surrogate model over the input-output space until exhausting the user-defined maximum number of evaluations or the wall-cl… ▽ More ytopt is a Python machine-learning-based autotuning software package developed within the ECP PROTEAS-TUNE project. The ytopt software adopts an asynchronous search framework that consists of sampling a small number of input parameter configurations and progressively fitting a surrogate model over the input-output space until exhausting the user-defined maximum number of evaluations or the wall-clock time. libEnsemble is a Python toolkit for coordinating workflows of asynchronous and dynamic ensembles of calculations across massively parallel resources developed within the ECP PETSc/TAO project. libEnsemble helps users take advantage of massively parallel resources to solve design, decision, and inference problems and expands the class of problems that can benefit from increased parallelism. In this paper we present our methodology and framework to integrate ytopt and libEnsemble to take advantage of massively parallel resources to accelerate the autotuning process. Specifically, we focus on using the proposed framework to autotune the ECP ExaSMR application OpenMC, an open source Monte Carlo particle transport code. OpenMC has seven tunable parameters some of which have large ranges such as the number of particles in-flight, which is in the range of 100,000 to 8 million, with its default setting of 1 million. Setting the proper combination of these parameter values to achieve the best performance is extremely time-consuming. Therefore, we apply the proposed framework to autotune the MPI/OpenMP offload version of OpenMC based on a user-defined metric such as the figure of merit (FoM) (particles/s) or energy efficiency energy-delay product (EDF) on the OLCF Frontier TDS system Crusher. The experimental results show that we achieve improvement up to 29.49% in FoM and up to 30.44% in EDP. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2401.04669 [pdf, other]

doi 10.1145/3577193.3593712

Transfer-Learning-Based Autotuning Using Gaussian Copula

Authors: Thomas Randall, Jaehoon Koo, Brice Videau, Michael Kruse, Xingfu Wu, Paul Hovland, Mary Hall, Rong Ge, Prasanna Balaprakash

Abstract: As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computatio… ▽ More As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computationally expensive approach. Transfer learning (TL)-based autotuning seeks to address this issue by leveraging the data from prior tuning. Current TL methods for autotuning spend significant time modeling the relationship between parameter configurations and performance, which is ineffective for few-shot (that is, few empirical evaluations) tuning on new tasks. We introduce the first generative TL-based autotuning approach based on the Gaussian copula (GC) to model the high-performing regions of the search space from prior data and then generate high-performing configurations for new tasks. This allows a sampling-based approach that maximizes few-shot performance and provides the first probabilistic estimation of the few-shot budget for effective TL-based autotuning. We compare our generative TL approach with state-of-the-art autotuning techniques on several benchmarks. We find that the GC is capable of achieving 64.37% of peak few-shot performance in its first evaluation. Furthermore, the GC model can determine a few-shot transfer budget that yields up to 33.39$\times$ speedup, a dramatic improvement over the 20.58$\times$ speedup using prior techniques. △ Less

Submitted 9 January, 2024; originally announced January 2024.

Comments: 13 pages, 5 figures, 7 tables, the definitive version of this work is published in the Proceedings of the ACM International Conference on Supercomputing 2023, available at https://dl.acm.org/doi/10.1145/3577193.3593712

ACM Class: I.2.4; G.3; D.2.8

Journal ref: Proceedings of the 37th International Conference on Supercomputing (2023) 37-49

arXiv:2305.18198 [pdf, ps, other]

Model Checking Race-freedom When "Sequential Consistency for Data-race-free Programs" is Guaranteed

Authors: Wenhao Wu, Jan Hückelheim, Paul D. Hovland, Ziqing Luo, Stephen F. Siegel

Abstract: Many parallel programming models guarantee that if all sequentially consistent (SC) executions of a program are free of data races, then all executions of the program will appear to be sequentially consistent. This greatly simplifies reasoning about the program, but leaves open the question of how to verify that all SC executions are race-free. In this paper, we show that with a few simple modific… ▽ More Many parallel programming models guarantee that if all sequentially consistent (SC) executions of a program are free of data races, then all executions of the program will appear to be sequentially consistent. This greatly simplifies reasoning about the program, but leaves open the question of how to verify that all SC executions are race-free. In this paper, we show that with a few simple modifications, model checking can be an effective tool for verifying race-freedom. We explore this technique on a suite of C programs parallelized with OpenMP. △ Less

Submitted 20 July, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

arXiv:2305.07546 [pdf, other]

Understanding Automatic Differentiation Pitfalls

Authors: Jan Hückelheim, Harshitha Menon, William Moses, Bruce Christianson, Paul Hovland, Laurent Hascoët

Abstract: Automatic differentiation, also known as backpropagation, AD, autodiff, or algorithmic differentiation, is a popular technique for computing derivatives of computer programs accurately and efficiently. Sometimes, however, the derivatives computed by AD could be interpreted as incorrect. These pitfalls occur systematically across tools and approaches. In this paper we broadly categorize problematic… ▽ More Automatic differentiation, also known as backpropagation, AD, autodiff, or algorithmic differentiation, is a popular technique for computing derivatives of computer programs accurately and efficiently. Sometimes, however, the derivatives computed by AD could be interpreted as incorrect. These pitfalls occur systematically across tools and approaches. In this paper we broadly categorize problematic usages of AD and illustrate each category with examples such as chaos, time-averaged oscillations, discretizations, fixed-point loops, lookup tables, and linear solvers. We also review debugging techniques and their effectiveness in these situations. With this article we hope to help readers avoid unexpected behavior, detect problems more easily when they occur, and have more realistic expectations from AD tools. △ Less

Submitted 12 May, 2023; originally announced May 2023.

arXiv:2305.02939 [pdf, other]

Tackling the Qubit Mapping Problem with Permutation-Aware Synthesis

Authors: Ji Liu, Ed Younis, Mathias Weiden, Paul Hovland, John Kubiatowicz, Costin Iancu

Abstract: We propose a novel hierarchical qubit mapping and routing algorithm. First, a circuit is decomposed into blocks that span an identical number of qubits. In the second stage permutation-aware synthesis (PAS), each block is optimized and synthesized in isolation. In the third stage a permutation-aware mapping (PAM) algorithm maps the blocks to the target device based on the information from the seco… ▽ More We propose a novel hierarchical qubit mapping and routing algorithm. First, a circuit is decomposed into blocks that span an identical number of qubits. In the second stage permutation-aware synthesis (PAS), each block is optimized and synthesized in isolation. In the third stage a permutation-aware mapping (PAM) algorithm maps the blocks to the target device based on the information from the second stage. Our approach is based on the following insights: (1) partitioning the circuit into blocks is beneficial for qubit mapping and routing; (2) with PAS, any block can implement an arbitrary input-output qubit mapping that reduces the gate count; and (3) with PAM, for two adjacent blocks we can select input-output permutations that optimize each block together with the amount of communication required at the block boundary. Whereas existing mapping algorithms preserve the original circuit structure and only introduce "minimal" communication via inserting SWAP or bridge gates, the PAS+PAM approach can additionally change the circuit structure and take full advantage of hardware-connectivity. Our experiments show that we can produce better-quality circuits than existing mapping algorithms or commercial compilers (Qiskit, TKET, BQSKit) with maximum optimization settings. For a combination of benchmarks we produce circuits shorter by up to 68% (18% on average) fewer gates than Qiskit, up to 36% (9% on average) fewer gates than TKET, and up to 67% (21% on average) fewer gates than BQSKit. Furthermore, the approach scales, and it can be seamlessly integrated into any quantum circuit compiler or optimization infrastructure. △ Less

Submitted 4 May, 2023; originally announced May 2023.

Comments: 12 pages, 9 figures, 5 tables

arXiv:2303.16245 [pdf, other]

ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

Authors: Xingfu Wu, Prasanna Balaprakash, Michael Kruse, Jaehoon Koo, Brice Videau, Paul Hovland, Valerie Taylor, Brad Geltz, Siddhartha Jana, Mary Hall

Abstract: As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application r… ▽ More As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application runtime and power/energy for energy efficient application execution, then use this framework to autotune four ECP proxy applications -- XSBench, AMG, SWFFT, and SW4lite. Our approach uses Bayesian optimization with a Random Forest surrogate model to effectively search parameter spaces with up to 6 million different configurations on two large-scale production systems, Theta at Argonne National Laboratory and Summit at Oak Ridge National Laboratory. The experimental results show that our autotuning framework at large scales has low overhead and achieves good scalability. Using the proposed autotuning framework to identify the best configurations, we achieve up to 91.59% performance improvement, up to 21.2% energy savings, and up to 37.84% EDP improvement on up to 4,096 nodes. △ Less

Submitted 28 March, 2023; originally announced March 2023.

Journal ref: to be pushilshed in CUG2023

arXiv:2302.02003 [pdf, other]

doi 10.1109/ISCAS46773.2023.10181370

QContext: Context-Aware Decomposition for Quantum Gates

Authors: Ji Liu, Max Bowman, Pranav Gokhale, Siddharth Dangwal, Jeffrey Larson, Frederic T. Chong, Paul D. Hovland

Abstract: In this paper we propose QContext, a new compiler structure that incorporates context-aware and topology-aware decompositions. Because of circuit equivalence rules and resynthesis, variants of a gate-decomposition template may exist. QContext exploits the circuit information and the hardware topology to select the gate variant that increases circuit optimization opportunities. We study the basis-g… ▽ More In this paper we propose QContext, a new compiler structure that incorporates context-aware and topology-aware decompositions. Because of circuit equivalence rules and resynthesis, variants of a gate-decomposition template may exist. QContext exploits the circuit information and the hardware topology to select the gate variant that increases circuit optimization opportunities. We study the basis-gate-level context-aware decomposition for Toffoli gates and the native-gate-level context-aware decomposition for CNOT gates. Our experiments show that QContext reduces the number of gates as compared with the state-of-the-art approach, Orchestrated Trios. △ Less

Submitted 3 February, 2023; originally announced February 2023.

Comments: 10 pages

arXiv:2105.04555 [pdf]

Customized Monte Carlo Tree Search for LLVM/Polly's Composable Loop Optimization Transformations

Authors: Jaehoon Koo, Prasanna Balaprakash, Michael Kruse, Xingfu Wu, Paul Hovland, Mary Hall

Abstract: Polly is the LLVM project's polyhedral loop nest optimizer. Recently, user-directed loop transformation pragmas were proposed based on LLVM/Clang and Polly. The search space exposed by the transformation pragmas is a tree, wherein each node represents a specific combination of loop transformations that can be applied to the code resulting from the parent node's loop transformations. We have develo… ▽ More Polly is the LLVM project's polyhedral loop nest optimizer. Recently, user-directed loop transformation pragmas were proposed based on LLVM/Clang and Polly. The search space exposed by the transformation pragmas is a tree, wherein each node represents a specific combination of loop transformations that can be applied to the code resulting from the parent node's loop transformations. We have developed a search algorithm based on Monte Carlo tree search (MCTS) to find the best combination of loop transformations. Our algorithm consists of two phases: exploring loop transformations at different depths of the tree to identify promising regions in the tree search space and exploiting those regions by performing a local search. Moreover, a restart mechanism is used to avoid the MCTS getting trapped in a local solution. The best and worst solutions are transferred from the previous phases of the restarts to leverage the search history. We compare our approach with random, greedy, and breadth-first search methods on PolyBench kernels and ECP proxy applications. Experimental results show that our MCTS algorithm finds pragma combinations with a speedup of 2.3x over Polly's heuristic optimizations on average. △ Less

Submitted 10 May, 2021; originally announced May 2021.

arXiv:2104.13242 [pdf, other]

Autotuning PolyBench Benchmarks with LLVM Clang/Polly Loop Optimization Pragmas Using Bayesian Optimization (extended version)

Authors: Xingfu Wu, Michael Kruse, Prasanna Balaprakash, Hal Finkel, Paul Hovland, Valerie Taylor, Mary Hall

Abstract: In this paper, we develop a ytopt autotuning framework that leverages Bayesian optimization to explore the parameter space search and compare four different supervised learning methods within Bayesian optimization and evaluate their effectiveness. We select six of the most complex PolyBench benchmarks and apply the newly developed LLVM Clang/Polly loop optimization pragmas to the benchmarks to opt… ▽ More In this paper, we develop a ytopt autotuning framework that leverages Bayesian optimization to explore the parameter space search and compare four different supervised learning methods within Bayesian optimization and evaluate their effectiveness. We select six of the most complex PolyBench benchmarks and apply the newly developed LLVM Clang/Polly loop optimization pragmas to the benchmarks to optimize them. We then use the autotuning framework to optimize the pragma parameters to improve their performance. The experimental results show that our autotuning approach outperforms the other compiling methods to provide the smallest execution time for the benchmarks syr2k, 3mm, heat-3d, lu, and covariance with two large datasets in 200 code evaluations for effectively searching the parameter spaces with up to 170,368 different configurations. We find that the Floyd-Warshall benchmark did not benefit from autotuning because Polly uses heuristics to optimize the benchmark to make it run much slower. To cope with this issue, we provide some compiler option solutions to improve the performance. Then we present loop autotuning without a user's knowledge using a simple mctree autotuning framework to further improve the performance of the Floyd-Warshall benchmark. We also extend the ytopt autotuning framework to tune a deep learning application. △ Less

Submitted 27 April, 2021; originally announced April 2021.

Comments: Submitted to CCPE journal. arXiv admin note: substantial text overlap with arXiv:2010.08040

arXiv:2010.08040 [pdf, other]

Autotuning PolyBench Benchmarks with LLVM Clang/Polly Loop Optimization Pragmas Using Bayesian Optimization

Authors: Xingfu Wu, Michael Kruse, Prasanna Balaprakash, Hal Finkel, Paul Hovland, Valerie Taylor, Mary Hall

Abstract: An autotuning is an approach that explores a search space of possible implementations/configurations of a kernel or an application by selecting and evaluating a subset of implementations/configurations on a target platform and/or use models to identify a high performance implementation/configuration. In this paper, we develop an autotuning framework that leverages Bayesian optimization to explore… ▽ More An autotuning is an approach that explores a search space of possible implementations/configurations of a kernel or an application by selecting and evaluating a subset of implementations/configurations on a target platform and/or use models to identify a high performance implementation/configuration. In this paper, we develop an autotuning framework that leverages Bayesian optimization to explore the parameter space search. We select six of the most complex benchmarks from the application domains of the PolyBench benchmarks (syr2k, 3mm, heat-3d, lu, covariance, and Floyd-Warshall) and apply the newly developed LLVM Clang/Polly loop optimization pragmas to the benchmarks to optimize them. We then use the autotuning framework to optimize the pragma parameters to improve their performance. The experimental results show that our autotuning approach outperforms the other compiling methods to provide the smallest execution time for the benchmarks syr2k, 3mm, heat-3d, lu, and covariance with two large datasets in 200 code evaluations for effectively searching the parameter spaces with up to 170,368 different configurations. We compare four different supervised learning methods within Bayesian optimization and evaluate their effectiveness. We find that the Floyd-Warshall benchmark did not benefit from autotuning because Polly uses heuristics to optimize the benchmark to make it run much slower. To cope with this issue, we provide some compiler option solutions to improve the performance. △ Less

Submitted 15 October, 2020; originally announced October 2020.

Comments: to be published in the 11th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS20)

arXiv:1909.02836 [pdf, other]

Computing Derivatives for PETSc Adjoint Solvers using Algorithmic Differentiation

Authors: J. G. Wallwork, P. Hovland, H. Zhang, O. Marin

Abstract: Most nonlinear partial differential equation (PDE) solvers require the Jacobian matrix associated to the differential operator. In PETSc, this is typically achieved by either an analytic derivation or numerical approximation method such as finite differences. For complex applications, hand-coding the Jacobian can be time-consuming and error-prone, yet computationally efficient. Whilst finite diffe… ▽ More Most nonlinear partial differential equation (PDE) solvers require the Jacobian matrix associated to the differential operator. In PETSc, this is typically achieved by either an analytic derivation or numerical approximation method such as finite differences. For complex applications, hand-coding the Jacobian can be time-consuming and error-prone, yet computationally efficient. Whilst finite difference approximations are straight-forward to implement, they have high arithmetic complexity and low accuracy. Alternatively, one may compute Jacobians using algorithmic differentiation (AD), yielding the same derivatives as an analytic derivation, with the added benefit that the implementation is problem independent. In this work, the operator overloading AD tool ADOL-C is applied to generate Jacobians for time-dependent, nonlinear PDEs and their adjoints. Various strategies are considered, including compressed and matrix-free approaches. In numerical experiments with a 2D diffusion-reaction model, the performance of these strategies has been studied and compared to the hand-derived version. △ Less

Submitted 6 September, 2019; originally announced September 2019.

Comments: 14 pages, 3 figures, 2 listings, 1 table

MSC Class: 68U01

arXiv:1907.02818 [pdf, other]

doi 10.1145/3337821.3337906

Automatic Differentiation for Adjoint Stencil Loops

Authors: Jan Hückelheim, Navjot Kukreja, Sri Hari Krishna Narayanan, Fabio Luporini, Gerard Gorman, Paul Hovland

Abstract: Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoin… ▽ More Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoint differentiation, or back-propagation, is sometimes used to obtain gradients of programs that contain stencil loops. Unfortunately, conventional automatic differentiation results in a memory access pattern that is not stencil-like and not easily parallelisable. In this paper we present a novel combination of automatic differentiation and loop transformations that preserves the structure and memory access pattern of stencil loops, while computing fully consistent derivatives. The generated loops can be parallelised and optimised for performance in the same way and using the same tools as the original computation. We have implemented this new technique in the Python tool PerforAD, which we release with this paper along with test cases derived from seismic imaging and computational fluid dynamics applications. △ Less

Submitted 5 July, 2019; originally announced July 2019.

Comments: ICPP 2019

arXiv:1903.03051 [pdf, other]

Training on the Edge: The why and the how

Authors: Navjot Kukreja, Alena Shilova, Olivier Beaumont, Jan Huckelheim, Nicola Ferrier, Paul Hovland, Gerard Gorman

Abstract: Edge computing is the natural progression from Cloud computing, where, instead of collecting all data and processing it centrally, like in a cloud computing environment, we distribute the computing power and try to do as much processing as possible, close to the source of the data. There are various reasons this model is being adopted quickly, including privacy, and reduced power and bandwidth req… ▽ More Edge computing is the natural progression from Cloud computing, where, instead of collecting all data and processing it centrally, like in a cloud computing environment, we distribute the computing power and try to do as much processing as possible, close to the source of the data. There are various reasons this model is being adopted quickly, including privacy, and reduced power and bandwidth requirements on the Edge nodes. While it is common to see inference being done on Edge nodes today, it is much less common to do training on the Edge. The reasons for this range from computational limitations, to it not being advantageous in reducing communications between the Edge nodes. In this paper, we explore some scenarios where it is advantageous to do training on the Edge, as well as the use of checkpointing strategies to save memory. △ Less

Submitted 13 February, 2019; originally announced March 2019.

Comments: Submitted to PAISE 2019

arXiv:1810.05268 [pdf, other]

Combining Checkpointing and Data Compression to Accelerate Adjoint-Based Optimization Problems

Authors: Navjot Kukreja, Jan Hueckelheim, Mathias Louboutin, Fabio Luporini, Paul Hovland, Gerard Gorman

Abstract: Seismic inversion and imaging are adjoint-based optimization problems that process up to terabytes of data, regularly exceeding the memory capacity of available computers. Data compression is an effective strategy to reduce this memory requirement by a certain factor, particularly if some loss in accuracy is acceptable. A popular alternative is checkpointing, where data is stored at selected point… ▽ More Seismic inversion and imaging are adjoint-based optimization problems that process up to terabytes of data, regularly exceeding the memory capacity of available computers. Data compression is an effective strategy to reduce this memory requirement by a certain factor, particularly if some loss in accuracy is acceptable. A popular alternative is checkpointing, where data is stored at selected points in time, and values at other times are recomputed as needed from the last stored state. This allows arbitrarily large adjoint computations with limited memory, at the cost of additional recomputations. In this paper, we combine compression and checkpointing for the first time to compute a realistic seismic inversion. The combination of checkpointing and compression allows larger adjoint computations compared to using only compression, and reduces the recomputation overhead significantly compared to using only checkpointing. △ Less

Submitted 20 September, 2021; v1 submitted 11 October, 2018; originally announced October 2018.

Comments: Accepted in European Conference on Parallel Proessing (EuroPar) 2019. Part of the Lecture Notes in Computer Science book series (LNCS, volume 11725)

arXiv:1705.07478 [pdf, other]

Report of the HPC Correctness Summit, Jan 25--26, 2017, Washington, DC

Authors: Ganesh Gopalakrishnan, Paul D. Hovland, Costin Iancu, Sriram Krishnamoorthy, Ignacio Laguna, Richard A. Lethin, Koushik Sen, Stephen F. Siegel, Armando Solar-Lezama

Abstract: Maintaining leadership in HPC requires the ability to support simulations at large scales and fidelity. In this study, we detail one of the most significant productivity challenges in achieving this goal, namely the increasing proclivity to bugs, especially in the face of growing hardware and software heterogeneity and sheer system scale. We identify key areas where timely new research must be pro… ▽ More Maintaining leadership in HPC requires the ability to support simulations at large scales and fidelity. In this study, we detail one of the most significant productivity challenges in achieving this goal, namely the increasing proclivity to bugs, especially in the face of growing hardware and software heterogeneity and sheer system scale. We identify key areas where timely new research must be proactively begun to address these challenges, and create new correctness tools that must ideally play a significant role even while ramping up toward exacale. We close with the proposal for a two-day workshop in which the problems identified in this report can be more broadly discussed, and specific plans to launch these new research thrusts identified. △ Less

Submitted 21 May, 2017; originally announced May 2017.

Comments: 57 pages

arXiv:1309.1780 [pdf, ps, other]

doi 10.5334/jors.aw

Software Abstractions and Methodologies for HPC Simulation Codes on Future Architectures

Authors: A. Dubey, S. Brandt, R. Brower, M. Giles, P. Hovland, D. Q. Lamb, F. Loffler, B. Norris, B. OShea, C. Rebbi, M. Snir, R. Thakur

Abstract: Large, complex, multi-scale, multi-physics simulation codes, running on high performance com-puting (HPC) platforms, have become essential to advancing science and engineering. These codes simulate multi-scale, multi-physics phenomena with unprecedented fidelity on petascale platforms, and are used by large communities. Continued ability of these codes to run on future platforms is as crucial to t… ▽ More Large, complex, multi-scale, multi-physics simulation codes, running on high performance com-puting (HPC) platforms, have become essential to advancing science and engineering. These codes simulate multi-scale, multi-physics phenomena with unprecedented fidelity on petascale platforms, and are used by large communities. Continued ability of these codes to run on future platforms is as crucial to their communities as continued improvements in instruments and facilities are to experimental scientists. However, the ability of code developers to do these things faces a serious challenge with the paradigm shift underway in platform architecture. The complexity and uncertainty of the future platforms makes it essential to approach this challenge cooperatively as a community. We need to develop common abstractions, frameworks, programming models and software development methodologies that can be applied across a broad range of complex simulation codes, and common software infrastructure to support them. In this position paper we express and discuss our belief that such an infrastructure is critical to the deployment of existing and new large, multi-scale, multi-physics codes on future HPC platforms. △ Less

Submitted 6 September, 2013; originally announced September 2013.

Comments: Position Paper

Showing 1–18 of 18 results for author: Hovland, P