-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1110 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 8 August, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Mixtral of Experts
Authors:
Albert Q. Jiang,
Alexandre Sablayrolles,
Antoine Roux,
Arthur Mensch,
Blanche Savary,
Chris Bamford,
Devendra Singh Chaplot,
Diego de las Casas,
Emma Bou Hanna,
Florian Bressand,
Gianna Lengyel,
Guillaume Bour,
Guillaume Lample,
Lélio Renard Lavaud,
Lucile Saulnier,
Marie-Anne Lachaux,
Pierre Stock,
Sandeep Subramanian,
Sophia Yang,
Szymon Antoniak,
Teven Le Scao,
Théophile Gervet,
Thibaut Lavril,
Thomas Wang,
Timothée Lacroix
, et al. (1 additional authors not shown)
Abstract:
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected e…
▽ More
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.
△ Less
Submitted 8 January, 2024;
originally announced January 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Mistral 7B
Authors:
Albert Q. Jiang,
Alexandre Sablayrolles,
Arthur Mensch,
Chris Bamford,
Devendra Singh Chaplot,
Diego de las Casas,
Florian Bressand,
Gianna Lengyel,
Guillaume Lample,
Lucile Saulnier,
Lélio Renard Lavaud,
Marie-Anne Lachaux,
Pierre Stock,
Teven Le Scao,
Thibaut Lavril,
Thomas Wang,
Timothée Lacroix,
William El Sayed
Abstract:
We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences o…
▽ More
We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. We also provide a model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses the Llama 2 13B -- Chat model both on human and automated benchmarks. Our models are released under the Apache 2.0 license.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
NEVIS'22: A Stream of 100 Tasks Sampled from 30 Years of Computer Vision Research
Authors:
Jorg Bornschein,
Alexandre Galashov,
Ross Hemsley,
Amal Rannen-Triki,
Yutian Chen,
Arslan Chaudhry,
Xu Owen He,
Arthur Douillard,
Massimo Caccia,
Qixuang Feng,
Jiajun Shen,
Sylvestre-Alvise Rebuffi,
Kitty Stacpoole,
Diego de las Casas,
Will Hawkins,
Angeliki Lazaridou,
Yee Whye Teh,
Andrei A. Rusu,
Razvan Pascanu,
Marc'Aurelio Ranzato
Abstract:
A shared goal of several machine learning communities like continual learning, meta-learning and transfer learning, is to design algorithms and models that efficiently and robustly adapt to unseen tasks. An even more ambitious goal is to build models that never stop adapting, and that become increasingly more efficient through time by suitably transferring the accrued knowledge. Beyond the study o…
▽ More
A shared goal of several machine learning communities like continual learning, meta-learning and transfer learning, is to design algorithms and models that efficiently and robustly adapt to unseen tasks. An even more ambitious goal is to build models that never stop adapting, and that become increasingly more efficient through time by suitably transferring the accrued knowledge. Beyond the study of the actual learning algorithm and model architecture, there are several hurdles towards our quest to build such models, such as the choice of learning protocol, metric of success and data needed to validate research hypotheses. In this work, we introduce the Never-Ending VIsual-classification Stream (NEVIS'22), a benchmark consisting of a stream of over 100 visual classification tasks, sorted chronologically and extracted from papers sampled uniformly from computer vision proceedings spanning the last three decades. The resulting stream reflects what the research community thought was meaningful at any point in time, and it serves as an ideal test bed to assess how well models can adapt to new tasks, and do so better and more efficiently as time goes by. Despite being limited to classification, the resulting stream has a rich diversity of tasks from OCR, to texture analysis, scene recognition, and so forth. The diversity is also reflected in the wide range of dataset sizes, spanning over four orders of magnitude. Overall, NEVIS'22 poses an unprecedented challenge for current sequential learning approaches due to the scale and diversity of tasks, yet with a low entry barrier as it is limited to a single modality and well understood supervised learning problems. Moreover, we provide a reference implementation including strong baselines and an evaluation protocol to compare methods in terms of their trade-off between accuracy and compute.
△ Less
Submitted 16 May, 2023; v1 submitted 15 November, 2022;
originally announced November 2022.
-
Training Compute-Optimal Large Language Models
Authors:
Jordan Hoffmann,
Sebastian Borgeaud,
Arthur Mensch,
Elena Buchatskaya,
Trevor Cai,
Eliza Rutherford,
Diego de Las Casas,
Lisa Anne Hendricks,
Johannes Welbl,
Aidan Clark,
Tom Hennigan,
Eric Noland,
Katie Millican,
George van den Driessche,
Bogdan Damoc,
Aurelia Guy,
Simon Osindero,
Karen Simonyan,
Erich Elsen,
Jack W. Rae,
Oriol Vinyals,
Laurent Sifre
Abstract:
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion…
▽ More
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.
△ Less
Submitted 29 March, 2022;
originally announced March 2022.
-
Unified Scaling Laws for Routed Language Models
Authors:
Aidan Clark,
Diego de las Casas,
Aurelia Guy,
Arthur Mensch,
Michela Paganini,
Jordan Hoffmann,
Bogdan Damoc,
Blake Hechtman,
Trevor Cai,
Sebastian Borgeaud,
George van den Driessche,
Eliza Rutherford,
Tom Hennigan,
Matthew Johnson,
Katie Millican,
Albin Cassirer,
Chris Jones,
Elena Buchatskaya,
David Budden,
Laurent Sifre,
Simon Osindero,
Oriol Vinyals,
Jack Rae,
Erich Elsen,
Koray Kavukcuoglu
, et al. (1 additional authors not shown)
Abstract:
The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better…
▽ More
The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.
△ Less
Submitted 9 February, 2022; v1 submitted 2 February, 2022;
originally announced February 2022.
-
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Authors:
Jack W. Rae,
Sebastian Borgeaud,
Trevor Cai,
Katie Millican,
Jordan Hoffmann,
Francis Song,
John Aslanides,
Sarah Henderson,
Roman Ring,
Susannah Young,
Eliza Rutherford,
Tom Hennigan,
Jacob Menick,
Albin Cassirer,
Richard Powell,
George van den Driessche,
Lisa Anne Hendricks,
Maribeth Rauh,
Po-Sen Huang,
Amelia Glaese,
Johannes Welbl,
Sumanth Dathathri,
Saffron Huang,
Jonathan Uesato,
John Mellor
, et al. (55 additional authors not shown)
Abstract:
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gop…
▽ More
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.
△ Less
Submitted 21 January, 2022; v1 submitted 8 December, 2021;
originally announced December 2021.
-
Improving language models by retrieving from trillions of tokens
Authors:
Sebastian Borgeaud,
Arthur Mensch,
Jordan Hoffmann,
Trevor Cai,
Eliza Rutherford,
Katie Millican,
George van den Driessche,
Jean-Baptiste Lespiau,
Bogdan Damoc,
Aidan Clark,
Diego de Las Casas,
Aurelia Guy,
Jacob Menick,
Roman Ring,
Tom Hennigan,
Saffron Huang,
Loren Maggiore,
Chris Jones,
Albin Cassirer,
Andy Brock,
Michela Paganini,
Geoffrey Irving,
Oriol Vinyals,
Simon Osindero,
Karen Simonyan
, et al. (3 additional authors not shown)
Abstract:
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to d…
▽ More
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.
△ Less
Submitted 7 February, 2022; v1 submitted 8 December, 2021;
originally announced December 2021.
-
Transformation-based Adversarial Video Prediction on Large-Scale Data
Authors:
Pauline Luc,
Aidan Clark,
Sander Dieleman,
Diego de Las Casas,
Yotam Doron,
Albin Cassirer,
Karen Simonyan
Abstract:
Recent breakthroughs in adversarial generative modeling have led to models capable of producing video samples of high quality, even on large and complex datasets of real-world video. In this work, we focus on the task of video prediction, where given a sequence of frames extracted from a video, the goal is to generate a plausible future sequence. We first improve the state of the art by performing…
▽ More
Recent breakthroughs in adversarial generative modeling have led to models capable of producing video samples of high quality, even on large and complex datasets of real-world video. In this work, we focus on the task of video prediction, where given a sequence of frames extracted from a video, the goal is to generate a plausible future sequence. We first improve the state of the art by performing a systematic empirical study of discriminator decompositions and proposing an architecture that yields faster convergence and higher performance than previous approaches. We then analyze recurrent units in the generator, and propose a novel recurrent unit which transforms its past hidden state according to predicted motion-like features, and refines it to handle dis-occlusions, scene changes and other complex behavior. We show that this recurrent unit consistently outperforms previous designs. Our final model leads to a leap in the state-of-the-art performance, obtaining a test set Frechet Video Distance of 25.7, down from 69.2, on the large-scale Kinetics-600 dataset.
△ Less
Submitted 17 November, 2021; v1 submitted 9 March, 2020;
originally announced March 2020.
-
DeepMind Control Suite
Authors:
Yuval Tassa,
Yotam Doron,
Alistair Muldal,
Tom Erez,
Yazhe Li,
Diego de Las Casas,
David Budden,
Abbas Abdolmaleki,
Josh Merel,
Andrew Lefrancq,
Timothy Lillicrap,
Martin Riedmiller
Abstract:
The DeepMind Control Suite is a set of continuous control tasks with a standardised structure and interpretable rewards, intended to serve as performance benchmarks for reinforcement learning agents. The tasks are written in Python and powered by the MuJoCo physics engine, making them easy to use and modify. We include benchmarks for several learning algorithms. The Control Suite is publicly avail…
▽ More
The DeepMind Control Suite is a set of continuous control tasks with a standardised structure and interpretable rewards, intended to serve as performance benchmarks for reinforcement learning agents. The tasks are written in Python and powered by the MuJoCo physics engine, making them easy to use and modify. We include benchmarks for several learning algorithms. The Control Suite is publicly available at https://www.github.com/deepmind/dm_control . A video summary of all tasks is available at http://youtu.be/rAai4QzcYbs .
△ Less
Submitted 2 January, 2018;
originally announced January 2018.
-
Dawn of the Selfie Era: The Whos, Wheres, and Hows of Selfies on Instagram
Authors:
Flávio Souza,
Diego de Las Casas,
Vinícius Flores,
SunBum Youn,
Meeyoung Cha,
Daniele Quercia,
Virgílio Almeida
Abstract:
Online interactions are increasingly involving images, especially those containing human faces, which are naturally attention grabbing and more effective at conveying feelings than text. To understand this new convention of digital culture, we study the collective behavior of sharing selfies on Instagram and present how people appear in selfies and which patterns emerge from such interactions. Ana…
▽ More
Online interactions are increasingly involving images, especially those containing human faces, which are naturally attention grabbing and more effective at conveying feelings than text. To understand this new convention of digital culture, we study the collective behavior of sharing selfies on Instagram and present how people appear in selfies and which patterns emerge from such interactions. Analysis of millions of photos shows that the amount of selfies has increased by 900 times from 2012 to 2014. Selfies are an effective medium to grab attention; they generate on average 1.1--3.2 times more likes and comments than other types of content on Instagram. Compared to other content, interactions involving selfies exhibit variations in homophily scores (in terms of age and gender) that suggest they are becoming more widespread. Their style also varies by cultural boundaries in that the average age and majority gender seen in selfies differ from one country to another. We provide explanations of such country-wise variations based on cultural and socioeconomic contexts.
△ Less
Submitted 19 October, 2015;
originally announced October 2015.