Search | arXiv e-print repository

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1110 additional authors not shown)

Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content. △ Less

Submitted 8 August, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2306.04348 [pdf, other]

Non-Hermitian Topological Magnonics

Authors: Tao Yu, Ji Zou, Bowen Zeng, J. W. Rao, Ke Xia

Abstract: Dissipation in mechanics, optics, acoustics, and electronic circuits is nowadays recognized to be not always detrimental but can be exploited to achieve non-Hermitian topological phases or properties with functionalities for potential device applications. As elementary excitations of ordered magnetic moments that exist in various magnetic materials, magnons are the information carriers in magnonic… ▽ More Dissipation in mechanics, optics, acoustics, and electronic circuits is nowadays recognized to be not always detrimental but can be exploited to achieve non-Hermitian topological phases or properties with functionalities for potential device applications. As elementary excitations of ordered magnetic moments that exist in various magnetic materials, magnons are the information carriers in magnonic devices with low-energy consumption for reprogrammable logic, non-reciprocal communication, and non-volatile memory functionalities. Non-Hermitian topological magnonics deals with the engineering of dissipation and/or gain for non-Hermitian topological phases or properties in magnets that are not achievable in the conventional Hermitian scenario, with associated functionalities cross-fertilized with their electronic, acoustic, optic, and mechanic counterparts, such as giant enhancement of magnonic frequency combs, magnon amplification, (quantum) sensing of the magnetic field with unprecedented sensitivity, magnon accumulation, and perfect absorption of microwaves. In this review article, we address the unified approach in constructing magnonic non-Hermitian Hamiltonian, introduce the basic non-Hermitian topological physics, and provide a comprehensive overview of the recent theoretical and experimental progress towards achieving distinct non-Hermitian topological phases or properties in magnonic devices, including exceptional points, exceptional nodal phases, non-Hermitian magnonic SSH model, and non-Hermitian skin effect. We emphasize the non-Hermitian Hamiltonian approach based on the Lindbladian or self-energy of the magnonic subsystem but address the physics beyond it as well, such as the crucial quantum jump effect in the quantum regime and non-Markovian dynamics. We provide a perspective for future opportunities and challenges before concluding this article. △ Less

Submitted 9 November, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: 101 pages, 35 figures

arXiv:2304.11948 [pdf]

Perspective on non-Hermitian physics in magnetic systems

Authors: Tao Yu, J. W. Rao

Abstract: A perspective on non-Hermitian physics in magnetic systems is addressed in this short article, including exceptional points, exceptional nodal phases, the non-Hermitian SSH model, and the non-Hermitian skin effect. A perspective on non-Hermitian physics in magnetic systems is addressed in this short article, including exceptional points, exceptional nodal phases, the non-Hermitian SSH model, and the non-Hermitian skin effect. △ Less

Submitted 23 August, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

Comments: 5 pages. Submitted as a section of Magnonic Roadmap 2024

arXiv:2302.08904 [pdf, other]

doi 10.1103/PhysRevLett.130.146702

Coherent Microwave Emission of a Gain-Driven Polariton

Authors: Bimu Yao, Y. S. Gui, J. W. Rao, Y. H. Zhang, Wei Lu, C. -M. Hu

Abstract: By developing a gain-embedded cavity magnonics platform, we create gain-driven polariton (GDP) that is activated by an amplified electromagnetic field. Distinct effects of gain-driven light-matter interaction, such as polariton auto-oscillations, polariton phase singularity, self-selection of a polariton bright mode, and gain-induced magnon-photon synchronization, are theoretically studied and exp… ▽ More By developing a gain-embedded cavity magnonics platform, we create gain-driven polariton (GDP) that is activated by an amplified electromagnetic field. Distinct effects of gain-driven light-matter interaction, such as polariton auto-oscillations, polariton phase singularity, self-selection of a polariton bright mode, and gain-induced magnon-photon synchronization, are theoretically studied and experimentally manifested. Utilizing the gain-sustained photon coherence of the GDP, we demonstrate polariton-based coherent microwave amplication (~ 40 dB) and achieve high-quality coherent microwave emission (Q > 10^9). △ Less

Submitted 15 February, 2023; originally announced February 2023.

Comments: 6 pages, 4 figures

arXiv:2204.04590 [pdf, other]

doi 10.1103/PhysRevLett.130.046705

Unveiling a Pump-Induced Magnon Mode via its Strong Interaction with Walker Modes

Authors: J. W. Rao, Bimu Yao, C. Y. Wang, C. Zhang, Tao Yu, Wei Lu

Abstract: We observe a power-dependent anticrossing of Walker spin-wave modes under microwave pumping when a ferrimagnet is placed in a microwave waveguide that does not support any discrete photon mode. We interpret this unexpected anticrossing as the generation of a pump-induced magnon mode that couples strongly to the Walker modes of the ferrimagnet. This anticrossing inherits an excellent tunability fro… ▽ More We observe a power-dependent anticrossing of Walker spin-wave modes under microwave pumping when a ferrimagnet is placed in a microwave waveguide that does not support any discrete photon mode. We interpret this unexpected anticrossing as the generation of a pump-induced magnon mode that couples strongly to the Walker modes of the ferrimagnet. This anticrossing inherits an excellent tunability from the pump, which allows us to control the anticrossing via the pump power, frequency, and waveform. Further, we realize a remarkable functionality of this anticrossing, namely, a microwave frequency comb, in terms of the nonlinear interaction that mixes the pump and probe frequencies. Such a frequency comb originates from the magnetic dynamics and thereby does not suffer from the charge noise. The unveiled hybrid magnonics driven away from its equilibrium enriches the utilization of anticrossing for coherent information processing. △ Less

Submitted 5 August, 2023; v1 submitted 9 April, 2022; originally announced April 2022.

Comments: 7 pages, 4 figures

arXiv:2203.15556 [pdf, other]

Training Compute-Optimal Large Language Models

Authors: Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre

Abstract: We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion… ▽ More We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4$\times$ more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher. △ Less

Submitted 29 March, 2022; originally announced March 2022.

arXiv:2112.11446 [pdf, other]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Authors: Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor , et al. (55 additional authors not shown)

Abstract: Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gop… ▽ More Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms. △ Less

Submitted 21 January, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

Comments: 120 pages

arXiv:2112.04426 [pdf, other]

Improving language models by retrieving from trillions of tokens

Authors: Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan , et al. (3 additional authors not shown)

Abstract: We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to d… ▽ More We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale. △ Less

Submitted 7 February, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

Comments: Fix incorrect reported numbers in Table 14

arXiv:2106.03517 [pdf, other]

Top-KAST: Top-K Always Sparse Training

Authors: Siddhant M. Jayakumar, Razvan Pascanu, Jack W. Rae, Simon Osindero, Erich Elsen

Abstract: Sparse neural networks are becoming increasingly important as the field seeks to improve the performance of existing models by scaling them up, while simultaneously trying to reduce power consumption and computational footprint. Unfortunately, most existing methods for inducing performant sparse models still entail the instantiation of dense parameters, or dense gradients in the backward-pass, dur… ▽ More Sparse neural networks are becoming increasingly important as the field seeks to improve the performance of existing models by scaling them up, while simultaneously trying to reduce power consumption and computational footprint. Unfortunately, most existing methods for inducing performant sparse models still entail the instantiation of dense parameters, or dense gradients in the backward-pass, during training. For very large models this requirement can be prohibitive. In this work we propose Top-KAST, a method that preserves constant sparsity throughout training (in both the forward and backward-passes). We demonstrate the efficacy of our approach by showing that it performs comparably to or better than previous works when training models on the established ImageNet benchmark, whilst fully maintaining sparsity. In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling where the current best performing architectures tend to have tens of billions of parameters and scaling up does not yet seem to have saturated performance. Sparse versions of these architectures can be run with significantly fewer resources, making them more widely accessible and applicable. Furthermore, in addition to being effective, our approach is straightforward and can easily be implemented in a wide range of existing machine learning frameworks with only a few additional lines of code. We therefore hope that our contribution will help enable the broader community to explore the potential held by massive models, without incurring massive computational cost. △ Less

Submitted 7 June, 2021; originally announced June 2021.

Journal ref: Advances in Neural Information Processing Systems, 33, 20744-20754

arXiv:2009.03950 [pdf, other]

doi 10.1103/PhysRevLett.125.147202

Unconventional Singularity in Anti-Parity-Time Symmetric Cavity Magnonics

Authors: Y. Yang, Yi-Pu Wang, J. W. Rao, Y. S. Gui, B. M. Yao, W. Lu, C. -M. Hu

Abstract: By engineering an anti-parity-time (anti-PT) symmetric cavity magnonics system with precise eigenspace controllability, we observe two different singularities in the same system. One type of singularity, the exceptional point (EP), is produced by tuning the magnon damping. Between two EPs, the maximal coherent superposition of photon and magnon states is robustly sustained by the preserved anti-PT… ▽ More By engineering an anti-parity-time (anti-PT) symmetric cavity magnonics system with precise eigenspace controllability, we observe two different singularities in the same system. One type of singularity, the exceptional point (EP), is produced by tuning the magnon damping. Between two EPs, the maximal coherent superposition of photon and magnon states is robustly sustained by the preserved anti-PT symmetry. The other type of singularity, arising from the dissipative coupling of two anti-resonances, is an unconventional bound state in the continuum (BIC). At the settings of BICs, the coupled system exhibits infinite discontinuities in the group delay. We find that both singularities co-exist at the equator of the Bloch sphere, which reveals a unique hybrid state that simultaneously exhibits the maximal coherent superposition and slow light capability. △ Less

Submitted 8 September, 2020; originally announced September 2020.

Comments: 6 pages, 4 figures

Journal ref: Phys. Rev. Lett. 125, 147202 (2020)

arXiv:2007.03356 [pdf, other]

Do Transformers Need Deep Long-Range Memory

Authors: Jack W. Rae, Ali Razavi

Abstract: Deep attention models have advanced the modelling of sequential data across many domains. For language modelling in particular, the Transformer-XL -- a Transformer augmented with a long-range memory of past activations -- has been shown to be state-of-the-art across a variety of well-studied benchmarks. The Transformer-XL incorporates a long-range memory at every layer of the network, which render… ▽ More Deep attention models have advanced the modelling of sequential data across many domains. For language modelling in particular, the Transformer-XL -- a Transformer augmented with a long-range memory of past activations -- has been shown to be state-of-the-art across a variety of well-studied benchmarks. The Transformer-XL incorporates a long-range memory at every layer of the network, which renders its state to be thousands of times larger than RNN predecessors. However it is unclear whether this is necessary. We perform a set of interventions to show that comparable performance can be obtained with 6X fewer long range memories and better performance can be obtained by limiting the range of attention in lower layers of the network. △ Less

Submitted 7 July, 2020; originally announced July 2020.

Comments: published at 58th Annual Meeting of the Association for Computational Linguistics. 6 pages, 4 figures, 1 table

arXiv:2005.04161 [pdf, other]

doi 10.1103/PhysRevB.102.014453

Electrical detection of magnon-photon interaction via auxiliary spin wave mode

Authors: Peng-Chao Xu, J. W. Rao, Y. Wang, Y. S. Gui, John Q. Xiao, Xiaofeng Jin, C. -M. Hu

Abstract: We report on the electrical detection of a hybrid magnon-photon system, which is comprised of a magnetic sample coupled to a planar cavity. While the uniform Kittel mode has the largest coupling strength among all the magnon modes, it only generates a modest voltage signal by means of inverse spin-Hall effect. We have found that the generated voltage can be significantly enhanced by introducing a… ▽ More We report on the electrical detection of a hybrid magnon-photon system, which is comprised of a magnetic sample coupled to a planar cavity. While the uniform Kittel mode has the largest coupling strength among all the magnon modes, it only generates a modest voltage signal by means of inverse spin-Hall effect. We have found that the generated voltage can be significantly enhanced by introducing a higher order magnon mode, which possesses a much higher spin pumping efficiency and furthermore, it is nearly degenerated with the Kittel mode. The experimental results can be explained by our theoretical model, and suggest that the use of an auxiliary magnon mode can realize the configuration of a magnon-photon system with both strong coupling and large spin current. △ Less

Submitted 8 May, 2020; originally announced May 2020.

Journal ref: Phys. Rev. B 102, 014453 (2020)

arXiv:1912.05478 [pdf, other]

doi 10.1103/PhysRevB.101.064404

Travelling photons mediated interactions between a magnon mode and a cavity photon mode

Authors: J. W. Rao, Y. P. Wang, Y. Yang, T. Yu, Y. S. Gui, X. L. Fan, D. S. Xue, C. -M. Hu

Abstract: We systematically study the indirect interaction between a magnon mode and a cavity photon mode mediated by travelling photons of a waveguide. From a general Hamiltonian, we derive the effective coupling strength between two separated modes, and obtain the theoretical expression of system's transmission. Accordingly, we design an experimental set-up consisting of a shield cavity photon mode, micro… ▽ More We systematically study the indirect interaction between a magnon mode and a cavity photon mode mediated by travelling photons of a waveguide. From a general Hamiltonian, we derive the effective coupling strength between two separated modes, and obtain the theoretical expression of system's transmission. Accordingly, we design an experimental set-up consisting of a shield cavity photon mode, microstrip line and a magnon system to test our theoretical predictions. From measured transmission spectra, indirect interaction, as well as mode hybridization, between two modes can be observed. All experimental observations support our theoretical predictions. In this work, we clarify the mechanism of travelling photon mediated interactions between two separate modes. Even without spatial mode overlap, two separated modes can still couple with each other through their correlated dissipations into a mutual travelling photon bus. This conclusion may help us understand the recently discovered dissipative coupling effect in cavity magnonics systems. Additionally, the physics and technique developed in this work may benefit us in designing new hybrid systems based on the waveguide magnonics. △ Less

Submitted 11 December, 2019; originally announced December 2019.

Comments: 6 papges and 4 figures

Journal ref: Phys. Rev. B 101, 064404 (2020)

arXiv:1911.05507 [pdf, other]

Compressive Transformers for Long-Range Sequence Modelling

Authors: Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy P. Lillicrap

Abstract: We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory me… ▽ More We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19. △ Less

Submitted 13 November, 2019; originally announced November 2019.

Comments: 19 pages, 6 figures, 10 tables

arXiv:1910.06764 [pdf, other]

Stabilizing Transformers for Reinforcement Learning

Authors: Emilio Parisotto, H. Francis Song, Jack W. Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant M. Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, Matthew M. Botvinick, Nicolas Heess, Raia Hadsell

Abstract: Owing to their ability to both effectively integrate information over long time horizons and scale to massive amounts of data, self-attention architectures have recently shown breakthrough success in natural language processing (NLP), achieving state-of-the-art results in domains such as language modeling and machine translation. Harnessing the transformer's ability to process long time horizons o… ▽ More Owing to their ability to both effectively integrate information over long time horizons and scale to massive amounts of data, self-attention architectures have recently shown breakthrough success in natural language processing (NLP), achieving state-of-the-art results in domains such as language modeling and machine translation. Harnessing the transformer's ability to process long time horizons of information could provide a similar performance boost in partially observable reinforcement learning (RL) domains, but the large-scale transformers used in NLP have yet to be successfully applied to the RL setting. In this work we demonstrate that the standard transformer architecture is difficult to optimize, which was previously observed in the supervised learning setting but becomes especially pronounced with RL objectives. We propose architectural modifications that substantially improve the stability and learning speed of the original Transformer and XL variant. The proposed architecture, the Gated Transformer-XL (GTrXL), surpasses LSTMs on challenging memory environments and achieves state-of-the-art results on the multi-task DMLab-30 benchmark suite, exceeding the performance of an external memory architecture. We show that the GTrXL, trained using the same losses, has stability and performance that consistently matches or exceeds a competitive LSTM baseline, including on more reactive tasks where memory is less critical. GTrXL offers an easy-to-train, simple-to-implement but substantially more expressive architectural alternative to the standard multi-layer LSTM ubiquitously used for RL agents in partially observable environments. △ Less

Submitted 13 October, 2019; originally announced October 2019.

arXiv:1910.02720 [pdf, other]

Meta-Learning Deep Energy-Based Memory Models

Authors: Sergey Bartunov, Jack W Rae, Simon Osindero, Timothy P Lillicrap

Abstract: We study the problem of learning associative memory -- a system which is able to retrieve a remembered pattern based on its distorted or incomplete version. Attractor networks provide a sound model of associative memory: patterns are stored as attractors of the network dynamics and associative retrieval is performed by running the dynamics starting from a query pattern until it converges to an att… ▽ More We study the problem of learning associative memory -- a system which is able to retrieve a remembered pattern based on its distorted or incomplete version. Attractor networks provide a sound model of associative memory: patterns are stored as attractors of the network dynamics and associative retrieval is performed by running the dynamics starting from a query pattern until it converges to an attractor. In such models the dynamics are often implemented as an optimization procedure that minimizes an energy function, such as in the classical Hopfield network. In general it is difficult to derive a writing rule for a given dynamics and energy that is both compressive and fast. Thus, most research in energy-based memory has been limited either to tractable energy models not expressive enough to handle complex high-dimensional objects such as natural images, or to models that do not offer fast writing. We present a novel meta-learning approach to energy-based memory models (EBMM) that allows one to use an arbitrary neural architecture as an energy model and quickly store patterns in its weights. We demonstrate experimentally that our EBMM approach can build compressed memories for synthetic and natural data, and is capable of associative retrieval that outperforms existing memory systems in terms of the reconstruction error and compression rate. △ Less

Submitted 20 April, 2021; v1 submitted 7 October, 2019; originally announced October 2019.

Comments: ICLR 2020

arXiv:1909.12238 [pdf, other]

V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control

Authors: H. Francis Song, Abbas Abdolmaleki, Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W. Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, Nicolas Heess, Dan Belov, Martin Riedmiller, Matthew M. Botvinick

Abstract: Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradie… ▽ More Some of the most successful applications of deep reinforcement learning to challenging domains in discrete and continuous control have used policy gradient methods in the on-policy setting. However, policy gradients can suffer from large variance that may limit performance, and in practice require carefully tuned entropy regularization to prevent policy collapse. As an alternative to policy gradient algorithms, we introduce V-MPO, an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) that performs policy iteration based on a learned state-value function. We show that V-MPO surpasses previously reported scores for both the Atari-57 and DMLab-30 benchmark suites in the multi-task setting, and does so reliably without importance weighting, entropy regularization, or population-based tuning of hyperparameters. On individual DMLab and Atari levels, the proposed algorithm can achieve scores that are substantially higher than has previously been reported. V-MPO is also applicable to problems with high-dimensional, continuous action spaces, which we demonstrate in the context of learning to control simulated humanoids with 22 degrees of freedom from full state observations and 56 degrees of freedom from pixel observations, as well as example OpenAI Gym tasks where V-MPO achieves substantially higher asymptotic scores than previously reported. △ Less

Submitted 26 September, 2019; originally announced September 2019.

Comments: * equal contribution

arXiv:1908.07907 [pdf, other]

doi 10.1103/PhysRevLett.123.127202

Nonreciprocity and Unidirectional Invisibility in Cavity Magnonics

Authors: Yi-Pu Wang, J. W. Rao, Y. Yang, Peng-Chao Xu, Y. S. Gui, B. M. Yao, J. Q. You, C. -M. Hu

Abstract: We reveal the cooperative effect of coherent and dissipative magnon-photon couplings in an open cavity magnonic system, which leads to nonreciprocity with a considerably large isolation ratio and flexible controllability. Furthermore, we discover unidirectional invisibility for microwave propagation, which appears at the zero-damping condition for hybrid magnon-photon modes. A simple model is deve… ▽ More We reveal the cooperative effect of coherent and dissipative magnon-photon couplings in an open cavity magnonic system, which leads to nonreciprocity with a considerably large isolation ratio and flexible controllability. Furthermore, we discover unidirectional invisibility for microwave propagation, which appears at the zero-damping condition for hybrid magnon-photon modes. A simple model is developed to capture the generic physics of the interference between coherent and dissipative couplings, which accurately reproduces the observations over a broad range of parameters. This general scheme could inspire methods to achieve nonreciprocity in other systems. △ Less

Submitted 21 August, 2019; originally announced August 2019.

Comments: 6 pages, 4 figures

arXiv:1907.06783 [pdf, other]

doi 10.1103/PhysRevB.100.094415

Cavity mediated dissipative coupling of distant magnetic moments: theory and experiment

Authors: Peng-Chao Xu, J. W. Rao, Y. S. Gui, Xiaofeng Jin, C. -M. Hu

Abstract: We investigate long-range coherent and dissipative coupling between two spatially separated magnets while both are coupled to a microwave cavity. A careful examination of the system shows that the indirect interaction between two magnon modes is dependent on their individual mechanisms of direct coupling to the cavity. If both magnon modes share the same form of coupling to the cavity (either cohe… ▽ More We investigate long-range coherent and dissipative coupling between two spatially separated magnets while both are coupled to a microwave cavity. A careful examination of the system shows that the indirect interaction between two magnon modes is dependent on their individual mechanisms of direct coupling to the cavity. If both magnon modes share the same form of coupling to the cavity (either coherent or dissipative), then the indirect coupling between them will produce level repulsion. Conversely, if the magnon modes have different forms of coupling to the cavity (one coherent and one dissipative), then their indirect coupling will produce level attraction. We further demonstrate the cavity-mediate nature of the indirect interaction through investigating the dependence of the indirect coupling strength on the frequency detuning between the magnon and cavity modes. Our work theoretically and experimentally explores indirect cavity mediate interactions in systems exhibiting both coherent and dissipative coupling, which opens a new avenue for controlling and utilizing light-matter interactions. △ Less

Submitted 15 July, 2019; originally announced July 2019.

Journal ref: Phys. Rev. B 100, 094415 (2019)

arXiv:1906.04304 [pdf, other]

Meta-Learning Neural Bloom Filters

Authors: Jack W Rae, Sergey Bartunov, Timothy P Lillicrap

Abstract: There has been a recent trend in training neural networks to replace data structures that have been crafted by hand, with an aim for faster execution, better accuracy, or greater compression. In this setting, a neural data structure is instantiated by training a network over many epochs of its inputs until convergence. In applications where inputs arrive at high throughput, or are ephemeral, train… ▽ More There has been a recent trend in training neural networks to replace data structures that have been crafted by hand, with an aim for faster execution, better accuracy, or greater compression. In this setting, a neural data structure is instantiated by training a network over many epochs of its inputs until convergence. In applications where inputs arrive at high throughput, or are ephemeral, training a network from scratch is not practical. This motivates the need for few-shot neural data structures. In this paper we explore the learning of approximate set membership over a set of data in one-shot via meta-learning. We propose a novel memory architecture, the Neural Bloom Filter, which is able to achieve significant compression gains over classical Bloom Filters and existing memory-augmented neural networks. △ Less

Submitted 10 June, 2019; originally announced June 2019.

Comments: International Conference on Machine Learning 2019

arXiv:1902.06795 [pdf, other]

Coherent control of magnon radiative damping with local photon states

Authors: B. M. Yao, T. Yu, Y. S. Gui, J. W. Rao, Y. T. Zhao, W. Lu, C. -M. Hu

Abstract: The collective excitation of ordered spins, known as spin waves or magnons, can in principle radiate by emitting travelling photons to an open system when decaying to the ground state. However, in contrast to the electric dipoles, magnetic dipoles contributed by magnons are more isolated from electromagnetic environment with negligible radiation in the vacuum, limiting their application in coheren… ▽ More The collective excitation of ordered spins, known as spin waves or magnons, can in principle radiate by emitting travelling photons to an open system when decaying to the ground state. However, in contrast to the electric dipoles, magnetic dipoles contributed by magnons are more isolated from electromagnetic environment with negligible radiation in the vacuum, limiting their application in coherent communication by photons. Recently, strong interaction between cavity standing-wave photons and magnons has been reported, indicating the possible manipulation of magnon radiation via tailoring photon states. Here, with loading an yttrium iron garnet sphere in a one-dimensional circular waveguide cavity in the presence of both travelling and standing photon modes, we demonstrate an efficient photon emissions from magnon and a significant magnon radiative damping with radiation rate found to be proportional to the local density of states (LDOS) of photon. By modulating the LDOS including its magnitude and/or polarization, we can flexibly tune the photon emission and magnon radiative damping on demand. Our findings provide a general way in manipulating photon emission from magnon radiation for harnessing energy and angular momentum generation, transfer and storage modulated by magnon in the cavity and waveguide electrodynamics. △ Less

Submitted 10 September, 2019; v1 submitted 18 February, 2019; originally announced February 2019.

Comments: 9+7 pages, 4+2 figures

arXiv:1901.07633 [pdf, other]

doi 10.1103/PhysRevApplied.11.054023

Control of the magnon-photon level attraction in a planar cavity

Authors: Y. Yang, J. W. Rao, Y. S. Gui, B. M. Yao, W. Lu, C. -M. Hu

Abstract: A resistive coupling circuit is used to model the recently discovered dissipative coupling in a hybridized cavity photon-magnon system. With this model as a basis we have designed a planar cavity in which a controllable transition between level attraction and level repulsion can be achieved. This behaviour can be quantitatively understood using an LCR circuit model with a complex coupling strength… ▽ More A resistive coupling circuit is used to model the recently discovered dissipative coupling in a hybridized cavity photon-magnon system. With this model as a basis we have designed a planar cavity in which a controllable transition between level attraction and level repulsion can be achieved. This behaviour can be quantitatively understood using an LCR circuit model with a complex coupling strength. Our work therefore develops and verifies a circuit method to model level repulsion and level attraction and confirms the universality of dissipative coupling in the cavity photon-magnon system. The realization of both coherent and dissipative couplings in a planar cavity may provide new avenues for the design and adaptation of dissipatively coupled systems for practical applications in information processing. △ Less

Submitted 8 April, 2019; v1 submitted 22 January, 2019; originally announced January 2019.

Comments: 7 papges, 4 figures and 2 additional figures in the appendix

Journal ref: Phys. Rev. Applied 11, 054023 (2019)

arXiv:1809.01233 [pdf, other]

doi 10.1103/PhysRevLett.121.137203

Level Attraction Due to Dissipative Magnon-Photon Coupling

Authors: M. Harder, Y. Yang, B. M. Yao, C. H. Yu, J. W. Rao, Y. S. Gui, R. L. Stamps, C. -M. Hu

Abstract: We report dissipative magnon-photon coupling caused by cavity Lenz effect, where the magnons in a magnet induce a rf current in the cavity, leading to a cavity back action that impedes the magnetization dynamics. This effect is revealed in our experiment as level attraction with a coalescence of hybridized magnon-photon modes, which is distinctly different from level repulsion with mode anticrossi… ▽ More We report dissipative magnon-photon coupling caused by cavity Lenz effect, where the magnons in a magnet induce a rf current in the cavity, leading to a cavity back action that impedes the magnetization dynamics. This effect is revealed in our experiment as level attraction with a coalescence of hybridized magnon-photon modes, which is distinctly different from level repulsion with mode anticrossing caused by coherent magnon-photon coupling. We develop a method to control the in- terpolation of coherent and dissipative magnon-photon coupling, and observe a matching condition where the two effects cancel. Our work sheds light on the so-far hidden side of magnon-photon coupling, opening a new avenue for controlling and utilizing light-matter interactions. △ Less

Submitted 4 September, 2018; originally announced September 2018.

Comments: 5 pages, 4 figures

arXiv:1803.10049 [pdf, other]

Fast Parametric Learning with Activation Memorization

Authors: Jack W Rae, Chris Dyer, Peter Dayan, Timothy P Lillicrap

Abstract: Neural networks trained with backpropagation often struggle to identify classes that have been observed a small number of times. In applications where most class labels are rare, such as language modelling, this can become a performance bottleneck. One potential remedy is to augment the network with a fast-learning non-parametric model which stores recent activations and class labels into an exter… ▽ More Neural networks trained with backpropagation often struggle to identify classes that have been observed a small number of times. In applications where most class labels are rare, such as language modelling, this can become a performance bottleneck. One potential remedy is to augment the network with a fast-learning non-parametric model which stores recent activations and class labels into an external memory. We explore a simplified architecture where we treat a subset of the model parameters as fast memory stores. This can help retain information over longer time intervals than a traditional memory, and does not require additional space or compute. In the case of image classification, we display faster binding of novel classes on an Omniglot image curriculum task. We also show improved performance for word-based language models on news reports (GigaWord), books (Project Gutenberg) and Wikipedia articles (WikiText-103) --- the latter achieving a state-of-the-art perplexity of 29.2. △ Less

Submitted 27 March, 2018; originally announced March 2018.

arXiv:1802.10542 [pdf, other]

Memory-based Parameter Adaptation

Authors: Pablo Sprechmann, Siddhant M. Jayakumar, Jack W. Rae, Alexander Pritzel, Adrià Puigdomènech Badia, Benigno Uria, Oriol Vinyals, Demis Hassabis, Razvan Pascanu, Charles Blundell

Abstract: Deep neural networks have excelled on a wide range of problems, from vision to language and game playing. Neural networks very gradually incorporate information into weights as they process data, requiring very low learning rates. If the training distribution shifts, the network is slow to adapt, and when it does adapt, it typically performs badly on the training distribution before the shift. Our… ▽ More Deep neural networks have excelled on a wide range of problems, from vision to language and game playing. Neural networks very gradually incorporate information into weights as they process data, requiring very low learning rates. If the training distribution shifts, the network is slow to adapt, and when it does adapt, it typically performs badly on the training distribution before the shift. Our method, Memory-based Parameter Adaptation, stores examples in memory and then uses a context-based lookup to directly modify the weights of a neural network. Much higher learning rates can be used for this local adaptation, reneging the need for many iterations over similar data before good predictions can be made. As our method is memory-based, it alleviates several shortcomings of neural networks, such as catastrophic forgetting, fast, stable acquisition of new knowledge, learning with an imbalanced class labels, and fast learning during evaluation. We demonstrate this on a range of supervised tasks: large-scale image classification and language modelling. △ Less

Submitted 28 February, 2018; originally announced February 2018.

Comments: Published as a conference paper at ICLR 2018

arXiv:1610.09027 [pdf, other]

Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes

Authors: Jack W Rae, Jonathan J Hunt, Tim Harley, Ivo Danihelka, Andrew Senior, Greg Wayne, Alex Graves, Timothy P Lillicrap

Abstract: Neural networks augmented with external memory have the ability to learn algorithmic solutions to complex tasks. These models appear promising for applications such as language modeling and machine translation. However, they scale poorly in both space and time as the amount of memory grows --- limiting their applicability to real-world domains. Here, we present an end-to-end differentiable memory… ▽ More Neural networks augmented with external memory have the ability to learn algorithmic solutions to complex tasks. These models appear promising for applications such as language modeling and machine translation. However, they scale poorly in both space and time as the amount of memory grows --- limiting their applicability to real-world domains. Here, we present an end-to-end differentiable memory access scheme, which we call Sparse Access Memory (SAM), that retains the representational power of the original approaches whilst training efficiently with very large memories. We show that SAM achieves asymptotic lower bounds in space and time complexity, and find that an implementation runs $1,\!000\times$ faster and with $3,\!000\times$ less physical memory than non-sparse models. SAM learns with comparable data efficiency to existing models on a range of synthetic tasks and one-shot Omniglot character recognition, and can scale to tasks requiring $100,\!000$s of time steps and memories. As well, we show how our approach can be adapted for models that maintain temporal associations between memories, as with the recently introduced Differentiable Neural Computer. △ Less

Submitted 27 October, 2016; originally announced October 2016.

Comments: in 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain

Showing 1–27 of 27 results for author: Rae, J W