Search | arXiv e-print repository

H2O-Danube3 Technical Report

Authors: Pascal Pfeiffer, Philipp Singer, Yauhen Babakhin, Gabor Fodor, Nischay Dhankhar, Sri Satish Ambati

Abstract: We present H2O-Danube3, a series of small language models consisting of H2O-Danube3-4B, trained on 6T tokens and H2O-Danube3-500M, trained on 4T tokens. Our models are pre-trained on high quality Web data consisting of primarily English tokens in three stages with different data mixes before final supervised tuning for chat version. The models exhibit highly competitive metrics across a multitude… ▽ More We present H2O-Danube3, a series of small language models consisting of H2O-Danube3-4B, trained on 6T tokens and H2O-Danube3-500M, trained on 4T tokens. Our models are pre-trained on high quality Web data consisting of primarily English tokens in three stages with different data mixes before final supervised tuning for chat version. The models exhibit highly competitive metrics across a multitude of academic, chat, and fine-tuning benchmarks. Thanks to its compact architecture, H2O-Danube3 can be efficiently run on a modern smartphone, enabling local inference and rapid processing capabilities even on mobile devices. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically. △ Less

Submitted 12 July, 2024; originally announced July 2024.

arXiv:2401.16818 [pdf, other]

H2O-Danube-1.8B Technical Report

Authors: Philipp Singer, Pascal Pfeiffer, Yauhen Babakhin, Maximilian Jeblick, Nischay Dhankhar, Gabor Fodor, Sri Satish Ambati

Abstract: We present H2O-Danube, a series of small 1.8B language models consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below… ▽ More We present H2O-Danube, a series of small 1.8B language models consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below the 2B parameter range. The models follow core principles of LLama 2 and Mistral, and we leverage and refine various techniques for pre-training large language models. We additionally release chat models trained with supervised fine-tuning followed by direct preference optimization. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically. △ Less

Submitted 15 April, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

arXiv:2310.13012 [pdf, other]

H2O Open Ecosystem for State-of-the-art Large Language Models

Authors: Arno Candel, Jon McKinney, Philipp Singer, Pascal Pfeiffer, Maximilian Jeblick, Chun Ming Lee, Marcos V. Conde

Abstract: Large Language Models (LLMs) represent a revolution in AI. However, they also pose many significant risks, such as the presence of biased, private, copyrighted or harmful text. For this reason we need open, transparent and safe solutions. We introduce a complete open-source ecosystem for developing and testing LLMs. The goal of this project is to boost open alternatives to closed-source approaches… ▽ More Large Language Models (LLMs) represent a revolution in AI. However, they also pose many significant risks, such as the presence of biased, private, copyrighted or harmful text. For this reason we need open, transparent and safe solutions. We introduce a complete open-source ecosystem for developing and testing LLMs. The goal of this project is to boost open alternatives to closed-source approaches. We release h2oGPT, a family of fine-tuned LLMs of diverse sizes. We also introduce H2O LLM Studio, a framework and no-code GUI designed for efficient fine-tuning, evaluation, and deployment of LLMs using the most recent state-of-the-art techniques. Our code and models are fully open-source. We believe this work helps to boost AI development and make it more accessible, efficient and trustworthy. The demo is available at: https://gpt.h2o.ai/ △ Less

Submitted 23 October, 2023; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: EMNLP 2023 Demo - ACL Empirical Methods in Natural Language Processing

arXiv:2309.09618 [pdf, other]

A Discussion on Generalization in Next-Activity Prediction

Authors: Luka Abb, Peter Pfeiffer, Peter Fettke, Jana-Rebecca Rehse

Abstract: Next activity prediction aims to forecast the future behavior of running process instances. Recent publications in this field predominantly employ deep learning techniques and evaluate their prediction performance using publicly available event logs. This paper presents empirical evidence that calls into question the effectiveness of these current evaluation approaches. We show that there is an en… ▽ More Next activity prediction aims to forecast the future behavior of running process instances. Recent publications in this field predominantly employ deep learning techniques and evaluate their prediction performance using publicly available event logs. This paper presents empirical evidence that calls into question the effectiveness of these current evaluation approaches. We show that there is an enormous amount of example leakage in all of the commonly used event logs, so that rather trivial prediction approaches perform almost as well as ones that leverage deep learning. We further argue that designing robust evaluations requires a more profound conceptual engagement with the topic of next-activity prediction, and specifically with the notion of generalization to new data. To this end, we present various prediction scenarios that necessitate different types of generalization to guide future research. △ Less

Submitted 18 September, 2023; originally announced September 2023.

Comments: Pre-print, published at the AI4BPM workshop at BPM 2023

arXiv:2306.08161 [pdf, other]

h2oGPT: Democratizing Large Language Models

Authors: Arno Candel, Jon McKinney, Philipp Singer, Pascal Pfeiffer, Maximilian Jeblick, Prithvi Prabhu, Jeff Gambera, Mark Landry, Shivam Bansal, Ryan Chesler, Chun Ming Lee, Marcos V. Conde, Pasha Stetsenko, Olivier Grellier, SriSatish Ambati

Abstract: Applications built on top of Large Language Models (LLMs) such as GPT-4 represent a revolution in AI due to their human-level capabilities in natural language processing. However, they also pose many significant risks such as the presence of biased, private, or harmful text, and the unauthorized inclusion of copyrighted material. We introduce h2oGPT, a suite of open-source code repositories for… ▽ More Applications built on top of Large Language Models (LLMs) such as GPT-4 represent a revolution in AI due to their human-level capabilities in natural language processing. However, they also pose many significant risks such as the presence of biased, private, or harmful text, and the unauthorized inclusion of copyrighted material. We introduce h2oGPT, a suite of open-source code repositories for the creation and use of LLMs based on Generative Pretrained Transformers (GPTs). The goal of this project is to create the world's best truly open-source alternative to closed-source approaches. In collaboration with and as part of the incredible and unstoppable open-source community, we open-source several fine-tuned h2oGPT models from 7 to 40 Billion parameters, ready for commercial use under fully permissive Apache 2.0 licenses. Included in our release is 100\% private document search using natural language. Open-source language models help boost AI development and make it more accessible and trustworthy. They lower entry hurdles, allowing people and groups to tailor these models to their needs. This openness increases innovation, transparency, and fairness. An open-source strategy is needed to share AI benefits fairly, and H2O.ai will continue to democratize AI and LLMs. △ Less

Submitted 16 June, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

Comments: Work in progress by H2O.ai, Inc

arXiv:2107.14048 [pdf]

Corridor for new mobility Aachen-Düsseldorf: Methods and concepts of the research project ACCorD

Authors: Laurent Kloeker, Amarin Kloeker, Fabian Thomsen, Armin Erraji, Lutz Eckstein, Serge Lamberty, Adrian Fazekas, Eszter Kalló, Markus Oeser, Charlotte Fléchon, Jochen Lohmiller, Pascal Pfeiffer, Martin Sommer, Helen Winter

Abstract: With the Corridor for New Mobility Aachen - Düsseldorf, an integrated development environment is created, incorporating existing test capabilities, to systematically test and validate automated vehicles in interaction with connected Intelligent Transport Systems Stations (ITS-Ss). This is achieved through a time- and cost-efficient toolchain and methodology, in which simulation, closed test sites… ▽ More With the Corridor for New Mobility Aachen - Düsseldorf, an integrated development environment is created, incorporating existing test capabilities, to systematically test and validate automated vehicles in interaction with connected Intelligent Transport Systems Stations (ITS-Ss). This is achieved through a time- and cost-efficient toolchain and methodology, in which simulation, closed test sites as well as test fields in public transport are linked in the best possible way. By implementing a digital twin, the recorded traffic events can be visualized in real-time and driving functions can be tested in the simulation based on real data. In order to represent diverse traffic scenarios, the corridor contains a highway section, a rural area, and urban areas. First, this paper outlines the project goals before describing the individual project contents in more detail. These include the concepts of traffic detection, driving function development, digital twin development, and public involvement. △ Less

Submitted 13 July, 2021; originally announced July 2021.

arXiv:2107.07728 [pdf, other]

Recognizing bird species in diverse soundscapes under weak supervision

Authors: Christof Henkel, Pascal Pfeiffer, Philipp Singer

Abstract: We present a robust classification approach for avian vocalization in complex and diverse soundscapes, achieving second place in the BirdCLEF2021 challenge. We illustrate how to make full use of pre-trained convolutional neural networks, by using an efficient modeling and training routine supplemented by novel augmentation methods. Thereby, we improve the generalization of weakly labeled crowd-sou… ▽ More We present a robust classification approach for avian vocalization in complex and diverse soundscapes, achieving second place in the BirdCLEF2021 challenge. We illustrate how to make full use of pre-trained convolutional neural networks, by using an efficient modeling and training routine supplemented by novel augmentation methods. Thereby, we improve the generalization of weakly labeled crowd-sourced data to productive data collected by autonomous recording units. As such, we illustrate how to progress towards an accurate automated assessment of avian population which would enable global biodiversity monitoring at scale, impossible by manual annotation. △ Less

Submitted 16 July, 2021; originally announced July 2021.

Comments: All authors contributed equally, 8 pages, 4 figures, submitted to CEUR-WS

arXiv:2106.08027 [pdf, other]

Multivariate Business Process Representation Learning utilizing Gramian Angular Fields and Convolutional Neural Networks

Authors: Peter Pfeiffer, Johannes Lahann, Peter Fettke

Abstract: Learning meaningful representations of data is an important aspect of machine learning and has recently been successfully applied to many domains like language understanding or computer vision. Instead of training a model for one specific task, representation learning is about training a model to capture all useful information in the underlying data and make it accessible for a predictor. For pred… ▽ More Learning meaningful representations of data is an important aspect of machine learning and has recently been successfully applied to many domains like language understanding or computer vision. Instead of training a model for one specific task, representation learning is about training a model to capture all useful information in the underlying data and make it accessible for a predictor. For predictive process analytics, it is essential to have all explanatory characteristics of a process instance available when making predictions about the future, as well as for clustering and anomaly detection. Due to the large variety of perspectives and types within business process data, generating a good representation is a challenging task. In this paper, we propose a novel approach for representation learning of business process instances which can process and combine most perspectives in an event log. In conjunction with a self-supervised pre-training method, we show the capabilities of the approach through a visualization of the representation space and case retrieval. Furthermore, the pre-trained model is fine-tuned to multiple process prediction tasks and demonstrates its effectiveness in comparison with existing approaches. △ Less

Submitted 15 June, 2021; originally announced June 2021.

Comments: Accepted at the Business Process Management Conference 2021

arXiv:1804.10120 [pdf, other]

Automatic generation of CUDA code performing tensor manipulations using C++ expression templates

Authors: Adam G. M. Lewis, Harald P. Pfeiffer

Abstract: We present a C++ library, TLoops, which uses a hierarchy of expression templates to represent operations upon tensorial quantities in single lines of C++ code that resemble analytic equations. These expressions may be run as-is, but may also be used to emit equivalent low-level C or CUDA code, which either performs the operations more quickly on the CPU, or allows them to be rapidly ported to run… ▽ More We present a C++ library, TLoops, which uses a hierarchy of expression templates to represent operations upon tensorial quantities in single lines of C++ code that resemble analytic equations. These expressions may be run as-is, but may also be used to emit equivalent low-level C or CUDA code, which either performs the operations more quickly on the CPU, or allows them to be rapidly ported to run on NVIDIA GPUs. We detail the expression template and C++-class hierarchy that represents the expressions and which makes automatic code-generation possible. We then present benchmarks of the expression-template code, the automatically generated C code, and the automatically generated CUDA code running on several generations of NVIDIA GPU. △ Less

Submitted 24 April, 2018; originally announced April 2018.

Comments: 46 pages, 5 figures

arXiv:1711.06276 [pdf, other]

doi 10.1103/PhysRevD.97.024031

Eccentric, nonspinning, inspiral, Gaussian-process merger approximant for the detection and characterization of eccentric binary black hole mergers

Authors: E. A. Huerta, C. J. Moore, Prayush Kumar, Daniel George, Alvin J. K. Chua, Roland Haas, Erik Wessel, Daniel Johnson, Derek Glennon, Adam Rebei, A. Miguel Holgado, Jonathan R. Gair, Harald P. Pfeiffer

Abstract: We present $\texttt{ENIGMA}$, a time domain, inspiral-merger-ringdown waveform model that describes non-spinning binary black holes systems that evolve on moderately eccentric orbits. The inspiral evolution is described using a consistent combination of post-Newtonian theory, self-force and black hole perturbation theory. Assuming eccentric binaries that circularize prior to coalescence, we smooth… ▽ More We present $\texttt{ENIGMA}$, a time domain, inspiral-merger-ringdown waveform model that describes non-spinning binary black holes systems that evolve on moderately eccentric orbits. The inspiral evolution is described using a consistent combination of post-Newtonian theory, self-force and black hole perturbation theory. Assuming eccentric binaries that circularize prior to coalescence, we smoothly match the eccentric inspiral with a stand-alone, quasi-circular merger, which is constructed using machine learning algorithms that are trained with quasi-circular numerical relativity waveforms. We show that $\texttt{ENIGMA}$ reproduces with excellent accuracy the dynamics of quasi-circular compact binaries. We validate $\texttt{ENIGMA}$ using a set of $\texttt{Einstein Toolkit}$ eccentric numerical relativity waveforms, which describe eccentric binary black hole mergers with mass-ratios between $1 \leq q \leq 5.5$, and eccentricities $e_0 \lesssim 0.2$ ten orbits before merger. We use this model to explore in detail the physics that can be extracted with moderately eccentric, non-spinning binary black hole mergers. We use $\texttt{ENIGMA}$ to show that GW150914, GW151226, GW170104, GW170814 and GW170608 can be effectively recovered with spinning, quasi-circular templates if the eccentricity of these events at a gravitational wave frequency of 10Hz satisfies $e_0\leq \{0.175,\, 0.125,\,0.175,\,0.175,\, 0.125\}$, respectively. We show that if these systems have eccentricities $e_0\sim 0.1$ at a gravitational wave frequency of 10Hz, they can be misclassified as quasi-circular binaries due to parameter space degeneracies between eccentricity and spin corrections. Using our catalog of eccentric numerical relativity simulations, we discuss the importance of including higher-order waveform multipoles in gravitational wave searches of eccentric binary black hole mergers. △ Less

Submitted 24 January, 2018; v1 submitted 16 November, 2017; originally announced November 2017.

Comments: 19 pages, 10 figures, 1 Appendix. v2: we use numerical relativity simulations to quantify the importance of including higher-order waveform multipoles for the detection of eccentric binary black hole mergers, references added. Accepted to Phys. Rev. D

ACM Class: J.2

Journal ref: Phys. Rev. D 97, 024031 (2018)

Showing 1–10 of 10 results for author: Pfeiffer, P