-
H2O-Danube3 Technical Report
Authors:
Pascal Pfeiffer,
Philipp Singer,
Yauhen Babakhin,
Gabor Fodor,
Nischay Dhankhar,
Sri Satish Ambati
Abstract:
We present H2O-Danube3, a series of small language models consisting of H2O-Danube3-4B, trained on 6T tokens and H2O-Danube3-500M, trained on 4T tokens. Our models are pre-trained on high quality Web data consisting of primarily English tokens in three stages with different data mixes before final supervised tuning for chat version. The models exhibit highly competitive metrics across a multitude…
▽ More
We present H2O-Danube3, a series of small language models consisting of H2O-Danube3-4B, trained on 6T tokens and H2O-Danube3-500M, trained on 4T tokens. Our models are pre-trained on high quality Web data consisting of primarily English tokens in three stages with different data mixes before final supervised tuning for chat version. The models exhibit highly competitive metrics across a multitude of academic, chat, and fine-tuning benchmarks. Thanks to its compact architecture, H2O-Danube3 can be efficiently run on a modern smartphone, enabling local inference and rapid processing capabilities even on mobile devices. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
H2O-Danube-1.8B Technical Report
Authors:
Philipp Singer,
Pascal Pfeiffer,
Yauhen Babakhin,
Maximilian Jeblick,
Nischay Dhankhar,
Gabor Fodor,
Sri Satish Ambati
Abstract:
We present H2O-Danube, a series of small 1.8B language models consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below…
▽ More
We present H2O-Danube, a series of small 1.8B language models consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below the 2B parameter range. The models follow core principles of LLama 2 and Mistral, and we leverage and refine various techniques for pre-training large language models. We additionally release chat models trained with supervised fine-tuning followed by direct preference optimization. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.
△ Less
Submitted 15 April, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
Semi-Supervised Segmentation of Salt Bodies in Seismic Images using an Ensemble of Convolutional Neural Networks
Authors:
Yauhen Babakhin,
Artsiom Sanakoyeu,
Hirotoshi Kitamura
Abstract:
Seismic image analysis plays a crucial role in a wide range of industrial applications and has been receiving significant attention. One of the essential challenges of seismic imaging is detecting subsurface salt structure which is indispensable for identification of hydrocarbon reservoirs and drill path planning. Unfortunately, exact identification of large salt deposits is notoriously difficult…
▽ More
Seismic image analysis plays a crucial role in a wide range of industrial applications and has been receiving significant attention. One of the essential challenges of seismic imaging is detecting subsurface salt structure which is indispensable for identification of hydrocarbon reservoirs and drill path planning. Unfortunately, exact identification of large salt deposits is notoriously difficult and professional seismic imaging often requires expert human interpretation of salt bodies. Convolutional neural networks (CNNs) have been successfully applied in many fields, and several attempts have been made in the field of seismic imaging. But the high cost of manual annotations by geophysics experts and scarce publicly available labeled datasets hinder the performance of the existing CNN-based methods. In this work, we propose a semi-supervised method for segmentation (delineation) of salt bodies in seismic images which utilizes unlabeled data for multi-round self-training. To reduce error amplification during self-training we propose a scheme which uses an ensemble of CNNs. We show that our approach outperforms state-of-the-art on the TGS Salt Identification Challenge dataset and is ranked the first among the 3234 competing methods.
△ Less
Submitted 5 August, 2019; v1 submitted 8 April, 2019;
originally announced April 2019.