-
Gemma 2: Improving Open Language Models at a Practical Size
Authors:
Gemma Team,
Morgane Riviere,
Shreya Pathak,
Pier Giuseppe Sessa,
Cassidy Hardin,
Surya Bhupatiraju,
Léonard Hussenot,
Thomas Mesnard,
Bobak Shahriari,
Alexandre Ramé,
Johan Ferret,
Peter Liu,
Pouya Tafti,
Abe Friesen,
Michelle Casbon,
Sabela Ramos,
Ravin Kumar,
Charline Le Lan,
Sammy Jerome,
Anton Tsitsulin,
Nino Vieillard,
Piotr Stanczyk,
Sertan Girgin,
Nikola Momchev,
Matt Hoffman
, et al. (172 additional authors not shown)
Abstract:
In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We al…
▽ More
In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.
△ Less
Submitted 2 August, 2024; v1 submitted 31 July, 2024;
originally announced August 2024.
-
u-$μ$P: The Unit-Scaled Maximal Update Parametrization
Authors:
Charlie Blake,
Constantin Eichenberg,
Josef Dean,
Lukas Balles,
Luke Y. Prince,
Björn Deiseroth,
Andres Felipe Cruz-Salinas,
Carlo Luschi,
Samuel Weinbach,
Douglas Orr
Abstract:
The Maximal Update Parametrization ($μ$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$μ$P, which improves upon $μ$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two tech…
▽ More
The Maximal Update Parametrization ($μ$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$μ$P, which improves upon $μ$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $μ$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$μ$P models reaching a lower loss than comparable $μ$P models and working out-of-the-box in FP8.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
Capabilities of Gemini Models in Medicine
Authors:
Khaled Saab,
Tao Tu,
Wei-Hung Weng,
Ryutaro Tanno,
David Stutz,
Ellery Wulczyn,
Fan Zhang,
Tim Strother,
Chunjong Park,
Elahe Vedadi,
Juanma Zambrano Chaves,
Szu-Yeu Hu,
Mike Schaekermann,
Aishwarya Kamath,
Yong Cheng,
David G. T. Barrett,
Cathy Cheung,
Basil Mustafa,
Anil Palepu,
Daniel McDuff,
Le Hou,
Tomer Golany,
Luyang Liu,
Jean-baptiste Alayrac,
Neil Houlsby
, et al. (42 additional authors not shown)
Abstract:
Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-G…
▽ More
Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.
△ Less
Submitted 1 May, 2024; v1 submitted 29 April, 2024;
originally announced April 2024.
-
Gemma: Open Models Based on Gemini Research and Technology
Authors:
Gemma Team,
Thomas Mesnard,
Cassidy Hardin,
Robert Dadashi,
Surya Bhupatiraju,
Shreya Pathak,
Laurent Sifre,
Morgane Rivière,
Mihir Sanjay Kale,
Juliette Love,
Pouya Tafti,
Léonard Hussenot,
Pier Giuseppe Sessa,
Aakanksha Chowdhery,
Adam Roberts,
Aditya Barua,
Alex Botev,
Alex Castro-Ros,
Ambrose Slone,
Amélie Héliou,
Andrea Tacchetti,
Anna Bulanova,
Antonia Paterson,
Beth Tsai,
Bobak Shahriari
, et al. (83 additional authors not shown)
Abstract:
This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Ge…
▽ More
This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations.
△ Less
Submitted 16 April, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1110 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 8 August, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
AI and the Opportunity for Shared Prosperity: Lessons from the History of Technology and the Economy
Authors:
Guy Ben-Ishai,
Jeff Dean,
James Manyika,
Ruth Porat,
Hal Varian,
Kent Walker
Abstract:
Recent progress in artificial intelligence (AI) marks a pivotal moment in human history. It presents the opportunity for machines to learn, adapt, and perform tasks that have the potential to assist people, from everyday activities to their most creative and ambitious projects. It also has the potential to help businesses and organizations harness knowledge, increase productivity, innovate, transf…
▽ More
Recent progress in artificial intelligence (AI) marks a pivotal moment in human history. It presents the opportunity for machines to learn, adapt, and perform tasks that have the potential to assist people, from everyday activities to their most creative and ambitious projects. It also has the potential to help businesses and organizations harness knowledge, increase productivity, innovate, transform, and power shared prosperity. This tremendous potential raises two fundamental questions: (1) Will AI actually advance national and global economic transformation to benefit society at large? and (2) What issues must we get right to fully realize AI's economic value, expand prosperity and improve lives everywhere? We explore these questions by considering the recent history of technology and innovation as a guide for the likely impact of AI and what we must do to realize its economic potential to benefit society. While we do not presume the future will be entirely like that past, for reasons we will discuss, we do believe prior experience with technological change offers many useful lessons. We conclude that while progress in AI presents a historic opportunity to advance our economic prosperity and future wellbeing, its economic benefits will not come automatically and that AI risks exacerbating existing economic challenges unless we collectively and purposefully act to enable its potential and address its challenges. We suggest a collective policy agenda - involving developers, deployers and users of AI, infrastructure providers, policymakers, and those involved in workforce training - that may help both realize and harness AI's economic potential and address its risks to our shared prosperity.
△ Less
Submitted 1 February, 2024; v1 submitted 17 January, 2024;
originally announced January 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Generating QM1B with PySCF$_{\text{IPU}}$
Authors:
Alexander Mathiasen,
Hatem Helal,
Kerstin Klaser,
Paul Balanca,
Josef Dean,
Carlo Luschi,
Dominique Beaini,
Andrew Fitzgibbon,
Dominic Masters
Abstract:
The emergence of foundation models in Computer Vision and Natural Language Processing have resulted in immense progress on downstream tasks. This progress was enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples. Th…
▽ More
The emergence of foundation models in Computer Vision and Natural Language Processing have resulted in immense progress on downstream tasks. This progress was enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples. These datasets are limited in size because the labels are computed using the accurate (but computationally demanding) predictions of Density Functional Theory (DFT). Notably, prior DFT datasets were created using CPU supercomputers without leveraging hardware acceleration. In this paper, we take a first step towards utilising hardware accelerators by introducing the data generator PySCF$_{\text{IPU}}$ using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets. Code and dataset are available on Github: http://github.com/graphcore-research/pyscf-ipu
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
Single-Mode Squeezed Light Generation and Tomography with an Integrated Optical Parametric Oscillator
Authors:
Taewon Park,
Hubert S. Stokowski,
Vahid Ansari,
Samuel Gyger,
Kevin K. S. Multani,
Oguz Tolga Celik,
Alexander Y. Hwang,
Devin J. Dean,
Felix M. Mayor,
Timothy P. McKenna,
Martin M. Fejer,
Amir H. Safavi-Naeini
Abstract:
Quantum optical technologies promise advances in sensing, computing, and communication. A key resource is squeezed light, where quantum noise is redistributed between optical quadratures. We introduce a monolithic, chip-scale platform that exploits the $χ^{(2)}$ nonlinearity of a thin-film lithium niobate (TFLN) resonator device to efficiently generate squeezed states of light. Our system integrat…
▽ More
Quantum optical technologies promise advances in sensing, computing, and communication. A key resource is squeezed light, where quantum noise is redistributed between optical quadratures. We introduce a monolithic, chip-scale platform that exploits the $χ^{(2)}$ nonlinearity of a thin-film lithium niobate (TFLN) resonator device to efficiently generate squeezed states of light. Our system integrates all essential components -- except for the laser and two detectors -- on a single chip with an area of one square centimeter, significantly reducing the size, operational complexity, and power consumption associated with conventional setups. Our work addresses challenges that have limited previous integrated nonlinear photonic implementations that rely on either $χ^{(3)}$ nonlinear resonators or on integrated waveguide $χ^{(2)}$ parametric amplifiers. Using the balanced homodyne measurement subsystem that we implemented on the same chip, we measure a squeezing of 0.55 dB and an anti-squeezing of 1.55 dB. We use 20 mW of input power to generate the parametric oscillator pump field by employing second harmonic generation on the same chip. Our work represents a substantial step toward compact and efficient quantum optical systems posed to leverage the rapid advances in integrated nonlinear and quantum photonics.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets
Authors:
Dominique Beaini,
Shenyang Huang,
Joao Alex Cunha,
Zhiyi Li,
Gabriela Moisescu-Pareja,
Oleksandr Dymov,
Samuel Maddrell-Mander,
Callum McLean,
Frederik Wenkel,
Luis Müller,
Jama Hussein Mohamud,
Ali Parviz,
Michael Craig,
Michał Koziarski,
Jiarui Lu,
Zhaocheng Zhu,
Cristian Gabellini,
Kerstin Klaser,
Josef Dean,
Cas Wognum,
Maciej Sypetkowski,
Guillaume Rabusseau,
Reihaneh Rabbany,
Jian Tang,
Christopher Morris
, et al. (10 additional authors not shown)
Abstract:
Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by…
▽ More
Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.
△ Less
Submitted 18 October, 2023; v1 submitted 6 October, 2023;
originally announced October 2023.
-
Arbitrary electro-optic bandwidth and frequency control in lithium niobate optical resonators
Authors:
Jason F. Herrmann,
Devin J. Dean,
Christopher J. Sarabalis,
Vahid Ansari,
Kevin Multani,
E. Alex Wollack,
Timothy P. McKenna,
Jeremy D. Witmer,
Amir H. Safavi-Naeini
Abstract:
In situ tunable photonic filters and memories are important for emerging quantum and classical optics technologies. However, most photonic devices have fixed resonances and bandwidths determined at the time of fabrication. Here we present an in situ tunable optical resonator on thin-film lithium niobate. By leveraging the linear electro-optic effect, we demonstrate widely tunable control over reso…
▽ More
In situ tunable photonic filters and memories are important for emerging quantum and classical optics technologies. However, most photonic devices have fixed resonances and bandwidths determined at the time of fabrication. Here we present an in situ tunable optical resonator on thin-film lithium niobate. By leveraging the linear electro-optic effect, we demonstrate widely tunable control over resonator frequency and bandwidth on two different devices. We observe up to $\sim50\times$ tuning in the bandwidth over $\sim50$ V with linear frequency control of $\sim230$ MHz/V. We also develop a closed-form model predicting the tuning behavior of the device. This paves the way for rapid phase and amplitude control over light transmitted through our device.
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
Integrated frequency-modulated optical parametric oscillator
Authors:
Hubert S. Stokowski,
Devin J. Dean,
Alexander Y. Hwang,
Taewon Park,
Oguz Tolga Celik,
Marc Jankowski,
Carsten Langrock,
Vahid Ansari,
Martin M. Fejer,
Amir H. Safavi-Naeini
Abstract:
Optical frequency combs have revolutionized precision measurement, time-keeping, and molecular spectroscopy. A substantial effort has developed around "microcombs": integrating comb-generating technologies into compact, reliable photonic platforms. Current approaches for generating these microcombs involve either the electro-optic (EO) or Kerr mechanisms. Despite rapid progress, maintaining high e…
▽ More
Optical frequency combs have revolutionized precision measurement, time-keeping, and molecular spectroscopy. A substantial effort has developed around "microcombs": integrating comb-generating technologies into compact, reliable photonic platforms. Current approaches for generating these microcombs involve either the electro-optic (EO) or Kerr mechanisms. Despite rapid progress, maintaining high efficiency and wide bandwidth remains challenging. Here, we introduce a new class of microcomb -- an integrated optical frequency comb generator that combines electro-optics and parametric amplification to yield a frequency-modulated optical parametric oscillator (FM-OPO). In stark contrast to EO and Kerr combs, the FM-OPO microcomb does not form pulses but maintains operational simplicity and highly efficient pump power utilization with an output resembling a frequency-modulated laser. We outline the working principles of FM-OPO and demonstrate them by fabricating the complete optical system in thin-film lithium niobate (LNOI). We measure pump to comb internal conversion efficiency exceeding 93% (34% out-coupled) over a nearly flat-top spectral distribution spanning approximately 1,000 modes (approximately 6 THz). Compared to an EO comb, the cavity dispersion rather than loss determines the FM-OPO bandwidth, enabling broadband combs with a smaller RF modulation power. The FM-OPO microcomb, with its robust operational dynamics, high efficiency, and large bandwidth, contributes a new approach to the field of microcombs and promises to herald an era of miniaturized precision measurement, and spectroscopy tools to accelerate advancements in metrology, spectroscopy, telecommunications, sensing, and computing.
△ Less
Submitted 9 July, 2023;
originally announced July 2023.
-
Many-objective Optimization via Voting for Elites
Authors:
Jackson Dean,
Nick Cheney
Abstract:
Real-world problems are often comprised of many objectives and require solutions that carefully trade-off between them. Current approaches to many-objective optimization often require challenging assumptions, like knowledge of the importance/difficulty of objectives in a weighted-sum single-objective paradigm, or enormous populations to overcome the curse of dimensionality in multi-objective Paret…
▽ More
Real-world problems are often comprised of many objectives and require solutions that carefully trade-off between them. Current approaches to many-objective optimization often require challenging assumptions, like knowledge of the importance/difficulty of objectives in a weighted-sum single-objective paradigm, or enormous populations to overcome the curse of dimensionality in multi-objective Pareto optimization. Combining elements from Many-Objective Evolutionary Algorithms and Quality Diversity algorithms like MAP-Elites, we propose Many-objective Optimization via Voting for Elites (MOVE). MOVE maintains a map of elites that perform well on different subsets of the objective functions. On a 14-objective image-neuroevolution problem, we demonstrate that MOVE is viable with a population of as few as 50 elites and outperforms a naive single-objective baseline. We find that the algorithm's performance relies on solutions jumping across bins (for a parent to produce a child that is elite for a different subset of objectives). We suggest that this type of goal-switching is an implicit method to automatic identification of stepping stones or curriculum learning. We comment on the similarities and differences between MOVE and MAP-Elites, hoping to provide insight to aid in the understanding of that approach $\unicode{x2013}$ and suggest future work that may inform this approach's use for many-objective problems in general.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
Strong Interaction Physics at the Luminosity Frontier with 22 GeV Electrons at Jefferson Lab
Authors:
A. Accardi,
P. Achenbach,
D. Adhikari,
A. Afanasev,
C. S. Akondi,
N. Akopov,
M. Albaladejo,
H. Albataineh,
M. Albrecht,
B. Almeida-Zamora,
M. Amaryan,
D. Androić,
W. Armstrong,
D. S. Armstrong,
M. Arratia,
J. Arrington,
A. Asaturyan,
A. Austregesilo,
H. Avagyan,
T. Averett,
C. Ayerbe Gayoso,
A. Bacchetta,
A. B. Balantekin,
N. Baltzell,
L. Barion
, et al. (419 additional authors not shown)
Abstract:
This document presents the initial scientific case for upgrading the Continuous Electron Beam Accelerator Facility (CEBAF) at Jefferson Lab (JLab) to 22 GeV. It is the result of a community effort, incorporating insights from a series of workshops conducted between March 2022 and April 2023. With a track record of over 25 years in delivering the world's most intense and precise multi-GeV electron…
▽ More
This document presents the initial scientific case for upgrading the Continuous Electron Beam Accelerator Facility (CEBAF) at Jefferson Lab (JLab) to 22 GeV. It is the result of a community effort, incorporating insights from a series of workshops conducted between March 2022 and April 2023. With a track record of over 25 years in delivering the world's most intense and precise multi-GeV electron beams, CEBAF's potential for a higher energy upgrade presents a unique opportunity for an innovative nuclear physics program, which seamlessly integrates a rich historical background with a promising future. The proposed physics program encompass a diverse range of investigations centered around the nonperturbative dynamics inherent in hadron structure and the exploration of strongly interacting systems. It builds upon the exceptional capabilities of CEBAF in high-luminosity operations, the availability of existing or planned Hall equipment, and recent advancements in accelerator technology. The proposed program cover various scientific topics, including Hadron Spectroscopy, Partonic Structure and Spin, Hadronization and Transverse Momentum, Spatial Structure, Mechanical Properties, Form Factors and Emergent Hadron Mass, Hadron-Quark Transition, and Nuclear Dynamics at Extreme Conditions, as well as QCD Confinement and Fundamental Symmetries. Each topic highlights the key measurements achievable at a 22 GeV CEBAF accelerator. Furthermore, this document outlines the significant physics outcomes and unique aspects of these programs that distinguish them from other existing or planned facilities. In summary, this document provides an exciting rationale for the energy upgrade of CEBAF to 22 GeV, outlining the transformative scientific potential that lies within reach, and the remarkable opportunities it offers for advancing our understanding of hadron physics and related fundamental phenomena.
△ Less
Submitted 24 August, 2023; v1 submitted 13 June, 2023;
originally announced June 2023.
-
Design and analysis of an exactly divergence-free hybridized discontinuous Galerkin method for incompressible flows on meshes with quadrilateral cells
Authors:
Joseph P. Dean,
Sander Rhebergen,
Garth N. Wells
Abstract:
We generalise a hybridized discontinuous Galerkin method for incompressible flow problems to non-affine cells, showing that with a suitable element mapping the generalised method preserves a key invariance property that eludes most methods, namely that any irrotational component of the prescribed force is exactly balanced by the pressure gradient and does not affect the velocity field. This invari…
▽ More
We generalise a hybridized discontinuous Galerkin method for incompressible flow problems to non-affine cells, showing that with a suitable element mapping the generalised method preserves a key invariance property that eludes most methods, namely that any irrotational component of the prescribed force is exactly balanced by the pressure gradient and does not affect the velocity field. This invariance property can be preserved in the discrete problem if the incompressibility constraint is satisfied in a sufficiently strong sense. We derive sufficient conditions to guarantee discretely divergence-free functions are exactly divergence-free and give examples of divergence-free finite elements on meshes with triangular, quadrilateral, tetrahedral, or hexahedral cells generated by a (possibly non-affine) map from their respective reference cells. In the case of quadrilateral cells, we prove an optimal error estimate for the velocity field that does not depend on the pressure approximation. Our analysis is supported by numerical results.
△ Less
Submitted 26 September, 2023; v1 submitted 8 June, 2023;
originally announced June 2023.
-
Brainformers: Trading Simplicity for Efficiency
Authors:
Yanqi Zhou,
Nan Du,
Yanping Huang,
Daiyi Peng,
Chang Lan,
Da Huang,
Siamak Shakeri,
David So,
Andrew Dai,
Yifeng Lu,
Zhifeng Chen,
Quoc Le,
Claire Cui,
James Laudon,
Jeff Dean
Abstract:
Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this in…
▽ More
Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.
△ Less
Submitted 25 April, 2024; v1 submitted 29 May, 2023;
originally announced June 2023.
-
The Present and Future of QCD
Authors:
P. Achenbach,
D. Adhikari,
A. Afanasev,
F. Afzal,
C. A. Aidala,
A. Al-bataineh,
D. K. Almaalol,
M. Amaryan,
D. Androić,
W. R. Armstrong,
M. Arratia,
J. Arrington,
A. Asaturyan,
E. C. Aschenauer,
H. Atac,
H. Avakian,
T. Averett,
C. Ayerbe Gayoso,
X. Bai,
K. N. Barish,
N. Barnea,
G. Basar,
M. Battaglieri,
A. A. Baty,
I. Bautista
, et al. (378 additional authors not shown)
Abstract:
This White Paper presents the community inputs and scientific conclusions from the Hot and Cold QCD Town Meeting that took place September 23-25, 2022 at MIT, as part of the Nuclear Science Advisory Committee (NSAC) 2023 Long Range Planning process. A total of 424 physicists registered for the meeting. The meeting highlighted progress in Quantum Chromodynamics (QCD) nuclear physics since the 2015…
▽ More
This White Paper presents the community inputs and scientific conclusions from the Hot and Cold QCD Town Meeting that took place September 23-25, 2022 at MIT, as part of the Nuclear Science Advisory Committee (NSAC) 2023 Long Range Planning process. A total of 424 physicists registered for the meeting. The meeting highlighted progress in Quantum Chromodynamics (QCD) nuclear physics since the 2015 LRP (LRP15) and identified key questions and plausible paths to obtaining answers to those questions, defining priorities for our research over the coming decade. In defining the priority of outstanding physics opportunities for the future, both prospects for the short (~ 5 years) and longer term (5-10 years and beyond) are identified together with the facilities, personnel and other resources needed to maximize the discovery potential and maintain United States leadership in QCD physics worldwide. This White Paper is organized as follows: In the Executive Summary, we detail the Recommendations and Initiatives that were presented and discussed at the Town Meeting, and their supporting rationales. Section 2 highlights major progress and accomplishments of the past seven years. It is followed, in Section 3, by an overview of the physics opportunities for the immediate future, and in relation with the next QCD frontier: the EIC. Section 4 provides an overview of the physics motivations and goals associated with the EIC. Section 5 is devoted to the workforce development and support of diversity, equity and inclusion. This is followed by a dedicated section on computing in Section 6. Section 7 describes the national need for nuclear data science and the relevance to QCD research.
△ Less
Submitted 4 March, 2023;
originally announced March 2023.
-
GPS++: Reviving the Art of Message Passing for Molecular Property Prediction
Authors:
Dominic Masters,
Josef Dean,
Kerstin Klaser,
Zhiyi Li,
Sam Maddrell-Mander,
Adam Sanders,
Hatem Helal,
Deniz Beker,
Andrew Fitzgibbon,
Shenyang Huang,
Ladislav Rampášek,
Dominique Beaini
Abstract:
We present GPS++, a hybrid Message Passing Neural Network / Graph Transformer model for molecular property prediction. Our model integrates a well-tuned local message passing component and biased global attention with other key ideas from prior literature to achieve state-of-the-art results on large-scale molecular dataset PCQM4Mv2. Through a thorough ablation study we highlight the impact of indi…
▽ More
We present GPS++, a hybrid Message Passing Neural Network / Graph Transformer model for molecular property prediction. Our model integrates a well-tuned local message passing component and biased global attention with other key ideas from prior literature to achieve state-of-the-art results on large-scale molecular dataset PCQM4Mv2. Through a thorough ablation study we highlight the impact of individual components and find that nearly all of the model's performance can be maintained without any use of global self-attention, showing that message passing is still a competitive approach for 3D molecular property prediction despite the recent dominance of graph transformers. We also find that our approach is significantly more accurate than prior art when 3D positional information is not available.
△ Less
Submitted 12 May, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
Integrated Quantum Optical Phase Sensor
Authors:
Hubert S. Stokowski,
Timothy P. McKenna,
Taewon Park,
Alexander Y. Hwang,
Devin J. Dean,
Oguz Tolga Celik,
Vahid Ansari,
Martin M. Fejer,
Amir H. Safavi-Naeini
Abstract:
The quantum noise of light fundamentally limits optical phase sensors. A semiclassical picture attributes this noise to the random arrival time of photons from a coherent light source such as a laser. An engineered source of squeezed states suppresses this noise and allows sensitivity beyond the standard quantum limit (SQL) for phase detection. Advanced gravitational wave detectors like LIGO have…
▽ More
The quantum noise of light fundamentally limits optical phase sensors. A semiclassical picture attributes this noise to the random arrival time of photons from a coherent light source such as a laser. An engineered source of squeezed states suppresses this noise and allows sensitivity beyond the standard quantum limit (SQL) for phase detection. Advanced gravitational wave detectors like LIGO have already incorporated such sources, and nascent efforts in realizing quantum biological measurements have provided glimpses into new capabilities emerging in quantum measurement. We need ways to engineer and use quantum light within deployable quantum sensors that operate outside the confines of a lab environment. Here we present a photonic integrated circuit fabricated in thin-film lithium niobate that provides a path to meet these requirements. We use the second-order nonlinearity to produce a squeezed state at the same frequency as the pump light and realize circuit control and sensing with electro-optics. Using a 26.2 milliwatts of optical power, we measure (2.7 $\pm$ 0.2 )$\%$ squeezing and apply it to increase the signal-to-noise ratio of phase measurement. We anticipate that on-chip photonic systems like this, which operate with low power and integrate all of the needed functionality on a single die, will open new opportunities for quantum optical sensing.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
GPS++: An Optimised Hybrid MPNN/Transformer for Molecular Property Prediction
Authors:
Dominic Masters,
Josef Dean,
Kerstin Klaser,
Zhiyi Li,
Sam Maddrell-Mander,
Adam Sanders,
Hatem Helal,
Deniz Beker,
Ladislav Rampášek,
Dominique Beaini
Abstract:
This technical report presents GPS++, the first-place solution to the Open Graph Benchmark Large-Scale Challenge (OGB-LSC 2022) for the PCQM4Mv2 molecular property prediction task. Our approach implements several key principles from the prior literature. At its core our GPS++ method is a hybrid MPNN/Transformer model that incorporates 3D atom positions and an auxiliary denoising task. The effectiv…
▽ More
This technical report presents GPS++, the first-place solution to the Open Graph Benchmark Large-Scale Challenge (OGB-LSC 2022) for the PCQM4Mv2 molecular property prediction task. Our approach implements several key principles from the prior literature. At its core our GPS++ method is a hybrid MPNN/Transformer model that incorporates 3D atom positions and an auxiliary denoising task. The effectiveness of GPS++ is demonstrated by achieving 0.0719 mean absolute error on the independent test-challenge PCQM4Mv2 split. Thanks to Graphcore IPU acceleration, GPS++ scales to deep architectures (16 layers), training at 3 minutes per epoch, and large ensemble (112 models), completing the final predictions in 1 hour 32 minutes, well under the 4 hour inference budget allocated. Our implementation is publicly available at: https://github.com/graphcore/ogb-lsc-pcqm4mv2.
△ Less
Submitted 6 December, 2022; v1 submitted 18 November, 2022;
originally announced December 2022.
-
Efficiently Scaling Transformer Inference
Authors:
Reiner Pope,
Sholto Douglas,
Aakanksha Chowdhery,
Jacob Devlin,
James Bradbury,
Anselm Levskaya,
Jonathan Heek,
Kefan Xiao,
Shivani Agrawal,
Jeff Dean
Abstract:
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a sim…
▽ More
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements. We combine these with a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks. We further show that with appropriate partitioning, the lower memory requirements of multiquery attention (i.e. multiple query heads share single key/value head) enables scaling up to 32x larger context lengths. Finally, we achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens, while supporting a long 2048-token context length on the PaLM 540B parameter model.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Scaling Instruction-Finetuned Language Models
Authors:
Hyung Won Chung,
Le Hou,
Shayne Longpre,
Barret Zoph,
Yi Tay,
William Fedus,
Yunxuan Li,
Xuezhi Wang,
Mostafa Dehghani,
Siddhartha Brahma,
Albert Webson,
Shixiang Shane Gu,
Zhuyun Dai,
Mirac Suzgun,
Xinyun Chen,
Aakanksha Chowdhery,
Alex Castro-Ros,
Marie Pellat,
Kevin Robinson,
Dasha Valter,
Sharan Narang,
Gaurav Mishra,
Adams Yu,
Vincent Zhao,
Yanping Huang
, et al. (10 additional authors not shown)
Abstract:
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects d…
▽ More
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
△ Less
Submitted 6 December, 2022; v1 submitted 20 October, 2022;
originally announced October 2022.
-
A Review of Sparse Expert Models in Deep Learning
Authors:
William Fedus,
Jeff Dean,
Barret Zoph
Abstract:
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute…
▽ More
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work.
△ Less
Submitted 4 September, 2022;
originally announced September 2022.
-
Computads for weak $ω$-categories as an inductive type
Authors:
Christopher J. Dean,
Eric Finster,
Ioannis Markakis,
David Reutter,
Jamie Vicary
Abstract:
We give a new description of computads for weak globular $ω$-categories by giving an explicit inductive definition of the free words. This yields a new understanding of computads, and allows a new definition of $ω$-category that avoids the technology of globular operads. Our framework permits direct proofs of important results via structural induction, and we use this to give new proofs that every…
▽ More
We give a new description of computads for weak globular $ω$-categories by giving an explicit inductive definition of the free words. This yields a new understanding of computads, and allows a new definition of $ω$-category that avoids the technology of globular operads. Our framework permits direct proofs of important results via structural induction, and we use this to give new proofs that every $ω$-category is equivalent to a free one, and that the category of computads with generator-preserving maps is a presheaf topos, giving a direct description of the index category. We prove that our resulting definition of $ω$-category agrees with that of Batanin and Leinster and that the induced notion of cofibrant replacement for $ω$-categories coincides with that of Garner.
△ Less
Submitted 20 March, 2024; v1 submitted 18 August, 2022;
originally announced August 2022.
-
High quantum efficiency parametric amplification via hybridized nonlinear optics
Authors:
Noah Flemens,
Dylan Heberle,
Jiaoyang Zheng,
Devin J. Dean,
Connor Davis,
Kevin Zawilski,
Peter G. Schunemann,
Jeffrey Moses
Abstract:
Parametric amplifiers have allowed breakthroughs in ultrafast, strong-field, and high-energy density laser science and are an essential tool for extending the frequency range of powerful emerging diode-pumped solid-state laser technology. However, their impact is limited by inherently low quantum efficiency due to nonuniform light extraction. Here we demonstrate a new type of parametric amplifier…
▽ More
Parametric amplifiers have allowed breakthroughs in ultrafast, strong-field, and high-energy density laser science and are an essential tool for extending the frequency range of powerful emerging diode-pumped solid-state laser technology. However, their impact is limited by inherently low quantum efficiency due to nonuniform light extraction. Here we demonstrate a new type of parametric amplifier based on hybridized nonlinear optics. Hybridization of parametric amplification with idler second harmonic generation induces unusual evolution dynamics for a fully parametric amplifier - with saturating rather than cyclic gain - observed here for the first time. This allows highly uniform light extraction enabling unprecedented efficiency for a lossless amplifier with Gaussian-like intensity profiles - a 48-dB single-stage gain with 68% quantum efficiency and 44% pump-to-signal energy conversion - a several-fold improvement over the standard. Possessing both laser-like high quantum efficiency and the advantages of thermal-loading free parametric systems, this simple approach can be implemented widely and have significant impact by increasing several-fold the power available for science and industry.
△ Less
Submitted 8 July, 2022;
originally announced July 2022.
-
Emergent Abilities of Large Language Models
Authors:
Jason Wei,
Yi Tay,
Rishi Bommasani,
Colin Raffel,
Barret Zoph,
Sebastian Borgeaud,
Dani Yogatama,
Maarten Bosma,
Denny Zhou,
Donald Metzler,
Ed H. Chi,
Tatsunori Hashimoto,
Oriol Vinyals,
Percy Liang,
Jeff Dean,
William Fedus
Abstract:
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot…
▽ More
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.
△ Less
Submitted 26 October, 2022; v1 submitted 15 June, 2022;
originally announced June 2022.
-
An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems
Authors:
Andrea Gesmundo,
Jeff Dean
Abstract:
Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is…
▽ More
Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is often focused to the study of common pitfalls such as catastrophic forgetting instead of being studied at a large scale as a critical component to build the next generation artificial intelligence.We propose an evolutionary method capable of generating large scale multitask models that support the dynamic addition of new tasks. The generated multitask models are sparsely activated and integrates a task-based routing that guarantees bounded compute cost and fewer added parameters per task as the model expands.The proposed method relies on a knowledge compartmentalization technique to achieve immunity against catastrophic forgetting and other common pitfalls such as gradient interference and negative transfer. We demonstrate empirically that the proposed method can jointly solve and achieve competitive results on 69public image classification tasks, for example improving the state of the art on a competitive benchmark such as cifar10 by achieving a 15% relative error reduction compared to the best model trained on public data.
△ Less
Submitted 15 November, 2022; v1 submitted 25 May, 2022;
originally announced May 2022.
-
muNet: Evolving Pretrained Deep Neural Networks into Scalable Auto-tuning Multitask Systems
Authors:
Andrea Gesmundo,
Jeff Dean
Abstract:
Most uses of machine learning today involve training a model from scratch for a particular task, or sometimes starting with a model pretrained on a related task and then fine-tuning on a downstream task. Both approaches offer limited knowledge transfer between different tasks, time-consuming human-driven customization to individual tasks and high computational costs especially when starting from r…
▽ More
Most uses of machine learning today involve training a model from scratch for a particular task, or sometimes starting with a model pretrained on a related task and then fine-tuning on a downstream task. Both approaches offer limited knowledge transfer between different tasks, time-consuming human-driven customization to individual tasks and high computational costs especially when starting from randomly initialized models. We propose a method that uses the layers of a pretrained deep neural network as building blocks to construct an ML system that can jointly solve an arbitrary number of tasks. The resulting system can leverage cross tasks knowledge transfer, while being immune from common drawbacks of multitask approaches such as catastrophic forgetting, gradients interference and negative transfer. We define an evolutionary approach designed to jointly select the prior knowledge relevant for each task, choose the subset of the model parameters to train and dynamically auto-tune its hyperparameters. Furthermore, a novel scale control method is employed to achieve quality/size trade-offs that outperform common fine-tuning techniques. Compared with standard fine-tuning on a benchmark of 10 diverse image classification tasks, the proposed model improves the average accuracy by 2.39% while using 47% less parameters per task.
△ Less
Submitted 25 May, 2022; v1 submitted 22 May, 2022;
originally announced May 2022.
-
The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink
Authors:
David Patterson,
Joseph Gonzalez,
Urs Hölzle,
Quoc Le,
Chen Liang,
Lluis-Miquel Munguia,
Daniel Rothchild,
David So,
Maud Texier,
Jeff Dean
Abstract:
Machine Learning (ML) workloads have rapidly grown in importance, but raised concerns about their carbon footprint. Four best practices can reduce ML training energy by up to 100x and CO2 emissions up to 1000x. By following best practices, overall ML energy use (across research, development, and production) held steady at <15% of Google's total energy use for the past three years. If the whole ML…
▽ More
Machine Learning (ML) workloads have rapidly grown in importance, but raised concerns about their carbon footprint. Four best practices can reduce ML training energy by up to 100x and CO2 emissions up to 1000x. By following best practices, overall ML energy use (across research, development, and production) held steady at <15% of Google's total energy use for the past three years. If the whole ML field were to adopt best practices, total carbon emissions from training would reduce. Hence, we recommend that ML papers include emissions explicitly to foster competition on more than just model quality. Estimates of emissions in papers that omitted them have been off 100x-100,000x, so publishing emissions has the added benefit of ensuring accurate accounting. Given the importance of climate change, we must get the numbers right to make certain that we work on its biggest challenges.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
PaLM: Scaling Language Modeling with Pathways
Authors:
Aakanksha Chowdhery,
Sharan Narang,
Jacob Devlin,
Maarten Bosma,
Gaurav Mishra,
Adam Roberts,
Paul Barham,
Hyung Won Chung,
Charles Sutton,
Sebastian Gehrmann,
Parker Schuh,
Kensen Shi,
Sasha Tsvyashchenko,
Joshua Maynez,
Abhishek Rao,
Parker Barnes,
Yi Tay,
Noam Shazeer,
Vinodkumar Prabhakaran,
Emily Reif,
Nan Du,
Ben Hutchinson,
Reiner Pope,
James Bradbury,
Jacob Austin
, et al. (42 additional authors not shown)
Abstract:
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Tran…
▽ More
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
△ Less
Submitted 5 October, 2022; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Pathways: Asynchronous Distributed Dataflow for ML
Authors:
Paul Barham,
Aakanksha Chowdhery,
Jeff Dean,
Sanjay Ghemawat,
Steven Hand,
Dan Hurt,
Michael Isard,
Hyeontaek Lim,
Ruoming Pang,
Sudip Roy,
Brennan Saeta,
Parker Schuh,
Ryan Sepassi,
Laurent El Shafey,
Chandramohan A. Thekkath,
Yonghui Wu
Abstract:
We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous paral…
▽ More
We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.
△ Less
Submitted 23 March, 2022;
originally announced March 2022.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Authors:
Barret Zoph,
Irwan Bello,
Sameer Kumar,
Nan Du,
Yanping Huang,
Jeff Dean,
Noam Shazeer,
William Fedus
Abstract:
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine…
▽ More
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).
△ Less
Submitted 29 April, 2022; v1 submitted 17 February, 2022;
originally announced February 2022.
-
Nuclear Forces for Precision Nuclear Physics -- a collection of perspectives
Authors:
Ingo Tews,
Zohreh Davoudi,
Andreas Ekström,
Jason D. Holt,
Kevin Becker,
Raúl Briceño,
David J. Dean,
William Detmold,
Christian Drischler,
Thomas Duguet,
Evgeny Epelbaum,
Ashot Gasparyan,
Jambul Gegelia,
Jeremy R. Green,
Harald W. Grießhammer,
Andrew D. Hanlon,
Matthias Heinz,
Heiko Hergert,
Martin Hoferichter,
Marc Illa,
David Kekejian,
Alejandro Kievsky,
Sebastian König,
Hermann Krebs,
Kristina D. Launey
, et al. (20 additional authors not shown)
Abstract:
This is a collection of perspective pieces contributed by the participants of the Institute of Nuclear Theory's Program on Nuclear Physics for Precision Nuclear Physics which was held virtually from April 19 to May 7, 2021. The collection represents the reflections of a vibrant and engaged community of researchers on the status of theoretical research in low-energy nuclear physics, the challenges…
▽ More
This is a collection of perspective pieces contributed by the participants of the Institute of Nuclear Theory's Program on Nuclear Physics for Precision Nuclear Physics which was held virtually from April 19 to May 7, 2021. The collection represents the reflections of a vibrant and engaged community of researchers on the status of theoretical research in low-energy nuclear physics, the challenges ahead, and new ideas and strategies to make progress in nuclear structure and reaction physics, effective field theory, lattice QCD, quantum information, and quantum computing. The contributed pieces solely reflect the perspectives of the respective authors and do not represent the viewpoints of the Institute for Nuclear theory or the organizers of the program.
△ Less
Submitted 2 February, 2022;
originally announced February 2022.
-
Interpretable Machine Learning for Materials Design
Authors:
James Dean,
Matthias Scheffler,
Thomas A. R. Purcell,
Sergey V. Barabash,
Rahul Bhowmik,
Timur Bazhirov
Abstract:
Fueled by the widespread adoption of Machine Learning (ML) and the high-throughput screening of materials, the data-centric approach to materials design has asserted itself as a robust and powerful tool for the in-silico prediction of materials properties. When training models to predict material properties, researchers often face a difficult choice between a model's interpretability or its perfor…
▽ More
Fueled by the widespread adoption of Machine Learning (ML) and the high-throughput screening of materials, the data-centric approach to materials design has asserted itself as a robust and powerful tool for the in-silico prediction of materials properties. When training models to predict material properties, researchers often face a difficult choice between a model's interpretability or its performance. We study this trade-off by leveraging four different state-of-the-art ML techniques: XGBoost, SISSO, Roost, and TPOT for the prediction of structural and electronic properties of perovskites and 2D materials. We then assess the future outlook of the continued integration of ML into materials discovery and identify key problems that will continue to challenge researchers as the size of the literature's datasets and complexity of models increases. Finally, we offer several possible solutions to these challenges with a focus on retaining interpretability and share our thoughts on magnifying the impact of ML on materials design.
△ Less
Submitted 30 November, 2021;
originally announced December 2021.
-
Femtojoule, femtosecond all-optical switching in lithium niobate nanophotonics
Authors:
Qiushi Guo,
Ryoto Sekine,
Luis Ledezma,
Rajveer Nehra,
Devin J. Dean,
Arkadev Roy,
Robert M. Gray,
Saman Jahani,
Alireza Marandi
Abstract:
Optical nonlinear functions are crucial for various applications in integrated photonics, such as all-optical information processing, photonic neural networks and on-chip ultrafast light sources. Due to the weak nonlinearities in most integrated photonic platforms, realizing optical nonlinear functions typically requires large driving energies in the picojoules level or beyond, thus imposing a bar…
▽ More
Optical nonlinear functions are crucial for various applications in integrated photonics, such as all-optical information processing, photonic neural networks and on-chip ultrafast light sources. Due to the weak nonlinearities in most integrated photonic platforms, realizing optical nonlinear functions typically requires large driving energies in the picojoules level or beyond, thus imposing a barrier for most applications. Here, we tackle this challenge and demonstrate an integrated nonlinear splitter device in lithium niobate nano-waveguides by simultaneous engineering of the dispersion and quasi-phase matching. We achieve non-resonant all-optical switching with ultra-low energies down to tens of femtojoules, a near instantaneous switching time of 18 fs, and a large extinction ratio of more than 5 dB. Our nonlinear splitter simultaneously realizes switch-on and -off operations and features a state-of-the-art switching energy-time product as low as $1.4 \times10^{-27}$ J$\cdot$s. We also show a path toward attojoule level all-optical switching by further optimizing the device geometry. Our results can enable on-chip ultrafast and energy-efficient all-optical information processing, computing systems, and light sources.
△ Less
Submitted 21 July, 2021;
originally announced July 2021.
-
Carbon Emissions and Large Neural Network Training
Authors:
David Patterson,
Joseph Gonzalez,
Quoc Le,
Chen Liang,
Lluis-Miquel Munguia,
Daniel Rothchild,
David So,
Maud Texier,
Jeff Dean
Abstract:
The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information. We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3-and refi…
▽ More
The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information. We calculate the energy use and carbon footprint of several recent large models-T5, Meena, GShard, Switch Transformer, and GPT-3-and refine earlier estimates for the neural architecture search that found Evolved Transformer. We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e): Large but sparsely activated DNNs can consume <1/10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters. Geographic location matters for ML workload scheduling since the fraction of carbon-free energy and resulting CO2e vary ~5X-10X, even within the same country and the same organization. We are now optimizing where and when large models are trained. Specific datacenter infrastructure matters, as Cloud datacenters can be ~1.4-2X more energy efficient than typical datacenters, and the ML-oriented accelerators inside them can be ~2-5X more effective than off-the-shelf systems. Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to ~100-1000X. These large factors also make retroactive estimates of energy cost difficult. To avoid miscalculations, we believe ML papers requiring large computational resources should make energy consumption and CO2e explicit when practical. We are working to be more transparent about energy use and CO2e in our future research. To help reduce the carbon footprint of ML, we believe energy usage and CO2e should be a key metric in evaluating models, and we are collaborating with MLPerf developers to include energy usage during training and inference in this industry standard benchmark.
△ Less
Submitted 23 April, 2021; v1 submitted 21 April, 2021;
originally announced April 2021.
-
Overscreening and Underscreening in Solid-Electrolyte Grain Boundary Space-Charge Layers
Authors:
Jacob M. Dean,
Samuel W. Coles,
William R. Saunders,
Andrew R. McCluskey,
Matthew J. Wolf,
Alison B. Walker,
Benjamin J. Morgan
Abstract:
Polycrystalline solids can exhibit material properties that differ significantly from those of equivalent single-crystal samples, in part, because of a spontaneous redistribution of mobile point defects into so-called space-charge regions adjacent to grain boundaries. The general analytical form of these space-charge regions is known only in the dilute limit, where defect-defect correlations can b…
▽ More
Polycrystalline solids can exhibit material properties that differ significantly from those of equivalent single-crystal samples, in part, because of a spontaneous redistribution of mobile point defects into so-called space-charge regions adjacent to grain boundaries. The general analytical form of these space-charge regions is known only in the dilute limit, where defect-defect correlations can be neglected. Using kinetic Monte Carlo simulations of a three-dimensional Coulomb lattice gas, we show that grain-boundary space-charge regions in non-dilute solid electrolytes exhibit overscreening -- damped oscillatory space-charge profiles -- and underscreening -- decay lengths that are longer than the corresponding Debye length and that increase with increasing defect-defect interaction strength. Overscreening and underscreening are known phenomena in concentrated liquid electrolytes, and the observation of functionally analogous behaviour in solid electrolyte space-charge regions suggests that the same underlying physics drives behaviour in both classes of systems. We therefore expect theoretical approaches developed to study non-dilute liquid electrolytes to be equally applicable to future studies of solid electrolytes.
△ Less
Submitted 1 April, 2021;
originally announced April 2021.
-
Intermediate Layer Optimization for Inverse Problems using Deep Generative Models
Authors:
Giannis Daras,
Joseph Dean,
Ajil Jalal,
Alexandros G. Dimakis
Abstract:
We propose Intermediate Layer Optimization (ILO), a novel optimization algorithm for solving inverse problems with deep generative models. Instead of optimizing only over the initial latent code, we progressively change the input layer obtaining successively more expressive generators. To explore the higher dimensional spaces, our method searches for latent codes that lie within a small $l_1$ ball…
▽ More
We propose Intermediate Layer Optimization (ILO), a novel optimization algorithm for solving inverse problems with deep generative models. Instead of optimizing only over the initial latent code, we progressively change the input layer obtaining successively more expressive generators. To explore the higher dimensional spaces, our method searches for latent codes that lie within a small $l_1$ ball around the manifold induced by the previous layer. Our theoretical analysis shows that by keeping the radius of the ball relatively small, we can improve the established error bound for compressed sensing with deep generative models. We empirically show that our approach outperforms state-of-the-art methods introduced in StyleGAN-2 and PULSE for a wide range of inverse problems including inpainting, denoising, super-resolution and compressed sensing.
△ Less
Submitted 15 February, 2021;
originally announced February 2021.
-
Interlocking Backpropagation: Improving depthwise model-parallelism
Authors:
Aidan N. Gomez,
Oscar Key,
Kuba Perlin,
Stephen Gou,
Nick Frosst,
Jeff Dean,
Yarin Gal
Abstract:
The number of parameters in state of the art neural networks has drastically increased in recent years. This surge of interest in large scale neural networks has motivated the development of new distributed training strategies enabling such models. One such strategy is model-parallel distributed training. Unfortunately, model-parallelism can suffer from poor resource utilisation, which leads to wa…
▽ More
The number of parameters in state of the art neural networks has drastically increased in recent years. This surge of interest in large scale neural networks has motivated the development of new distributed training strategies enabling such models. One such strategy is model-parallel distributed training. Unfortunately, model-parallelism can suffer from poor resource utilisation, which leads to wasted resources. In this work, we improve upon recent developments in an idealised model-parallel optimisation setting: local learning. Motivated by poor resource utilisation in the global setting and poor task performance in the local setting, we introduce a class of intermediary strategies between local and global learning referred to as interlocking backpropagation. These strategies preserve many of the compute-efficiency advantages of local optimisation, while recovering much of the task performance achieved by global optimisation. We assess our strategies on both image classification ResNets and Transformer language models, finding that our strategy consistently out-performs local learning in terms of task performance, and out-performs global learning in training efficiency.
△ Less
Submitted 7 July, 2022; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Phase transitions on non-uniformly curved surfaces: Coupling between phase and location
Authors:
Jack O. Law,
Jacob M. Dean,
Mark A. Miller,
Halim Kusumaatmaja
Abstract:
For particles confined to two dimensions, any curvature of the surface affects the structural, kinetic and thermodynamic properties of the system. If the curvature is non-uniform, an even richer range of behaviours can emerge. Using a combination of bespoke Monte Carlo, molecular dynamics and basin-hopping methods, we show that the stable states of attractive colloids confined to non-uniformly cur…
▽ More
For particles confined to two dimensions, any curvature of the surface affects the structural, kinetic and thermodynamic properties of the system. If the curvature is non-uniform, an even richer range of behaviours can emerge. Using a combination of bespoke Monte Carlo, molecular dynamics and basin-hopping methods, we show that the stable states of attractive colloids confined to non-uniformly curved surfaces are distinguished not only by the phase of matter but also by their location on the surface. Consequently, the transitions between these states involve cooperative migration of the entire colloidal assembly. We demonstrate these phenomena on toroidal and sinusoidal surfaces for model colloids with different ranges of interactions as described by the Morse potential. In all cases, the behaviour can be rationalised in terms of three universal considerations: cluster perimeter, stress, and the packing of next-nearest neighbours.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
From climate change to pandemics: decision science can help scientists have impact
Authors:
Christopher M. Baker,
Patricia T. Campbell,
Iadine Chades,
Angela J. Dean,
Susan M. Hester,
Matthew H. Holden,
James M. McCaw,
Jodie McVernon,
Robert Moss,
Freya M. Shearer,
Hugh P. Possingham
Abstract:
Scientific knowledge and advances are a cornerstone of modern society. They improve our understanding of the world we live in and help us navigate global challenges including emerging infectious diseases, climate change and the biodiversity crisis. For any scientist, whether they work primarily in fundamental knowledge generation or in the applied sciences, it is important to understand how scienc…
▽ More
Scientific knowledge and advances are a cornerstone of modern society. They improve our understanding of the world we live in and help us navigate global challenges including emerging infectious diseases, climate change and the biodiversity crisis. For any scientist, whether they work primarily in fundamental knowledge generation or in the applied sciences, it is important to understand how science fits into a decision-making framework. Decision science is a field that aims to pinpoint evidence-based management strategies. It provides a framework for scientists to directly impact decisions or to understand how their work will fit into a decision process. Decision science is more than undertaking targeted and relevant scientific research or providing tools to assist policy makers; it is an approach to problem formulation, bringing together mathematical modelling, stakeholder values and logistical constraints to support decision making. In this paper we describe decision science, its use in different contexts, and highlight current gaps in methodology and application. The COVID-19 pandemic has thrust mathematical models into the public spotlight, but it is one of innumerable examples in which modelling informs decision making. Other examples include models of storm systems (eg. cyclones, hurricanes) and climate change. Although the decision timescale in these examples differs enormously (from hours to decades), the underlying decision science approach is common across all problems. Bridging communication gaps between different groups is one of the greatest challenges for scientists. However, by better understanding and engaging with the decision-making processes, scientists will have greater impact and make stronger contributions to important societal problems.
△ Less
Submitted 21 October, 2021; v1 submitted 26 July, 2020;
originally announced July 2020.
-
Globular Multicategories with Homomorphism Types
Authors:
Christopher J. Dean
Abstract:
We introduce a notion of globular multicategory with homomorphism types. These structures arise when organizing collections of "higher category-like" objects such as type theories with identity types. We show how these globular multicategories can be used to construct various weak higher categorical structures of types and terms.
We introduce a notion of globular multicategory with homomorphism types. These structures arise when organizing collections of "higher category-like" objects such as type theories with identity types. We show how these globular multicategories can be used to construct various weak higher categorical structures of types and terms.
△ Less
Submitted 28 May, 2020;
originally announced May 2020.
-
Chip Placement with Deep Reinforcement Learning
Authors:
Azalia Mirhoseini,
Anna Goldie,
Mustafa Yazgan,
Joe Jiang,
Ebrahim Songhori,
Shen Wang,
Young-Joon Lee,
Eric Johnson,
Omkar Pathak,
Sungmin Bae,
Azade Nazi,
Jiwoo Pak,
Andy Tong,
Kavya Srinivasa,
William Hang,
Emre Tuncer,
Anand Babu,
Quoc V. Le,
James Laudon,
Richard Ho,
Roger Carpenter,
Jeff Dean
Abstract:
In this work, we present a learning-based approach to chip placement, one of the most complex and time-consuming stages of the chip design process. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of chip blocks, our method becomes better at rapidly generating optimized placements for previously…
▽ More
In this work, we present a learning-based approach to chip placement, one of the most complex and time-consuming stages of the chip design process. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of chip blocks, our method becomes better at rapidly generating optimized placements for previously unseen chip blocks. To achieve these results, we pose placement as a Reinforcement Learning (RL) problem and train an agent to place the nodes of a chip netlist onto a chip canvas. To enable our RL policy to generalize to unseen blocks, we ground representation learning in the supervised task of predicting placement quality. By designing a neural architecture that can accurately predict reward across a wide variety of netlists and their placements, we are able to generate rich feature embeddings of the input netlists. We then use this architecture as the encoder of our policy and value networks to enable transfer learning. Our objective is to minimize PPA (power, performance, and area), and we show that, in under 6 hours, our method can generate placements that are superhuman or comparable on modern accelerator netlists, whereas existing baselines require human experts in the loop and take several weeks.
△ Less
Submitted 22 April, 2020;
originally announced April 2020.
-
The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 Challenge
Authors:
Nicholas Heller,
Fabian Isensee,
Klaus H. Maier-Hein,
Xiaoshuai Hou,
Chunmei Xie,
Fengyi Li,
Yang Nan,
Guangrui Mu,
Zhiyong Lin,
Miofei Han,
Guang Yao,
Yaozong Gao,
Yao Zhang,
Yixin Wang,
Feng Hou,
Jiawei Yang,
Guangwei Xiong,
Jiang Tian,
Cheng Zhong,
Jun Ma,
Jack Rickman,
Joshua Dean,
Bethany Stai,
Resha Tejpaul,
Makinna Oestreich
, et al. (16 additional authors not shown)
Abstract:
There is a large body of literature linking anatomic and geometric characteristics of kidney tumors to perioperative and oncologic outcomes. Semantic segmentation of these tumors and their host kidneys is a promising tool for quantitatively characterizing these lesions, but its adoption is limited due to the manual effort required to produce high-quality 3D segmentations of these structures. Recen…
▽ More
There is a large body of literature linking anatomic and geometric characteristics of kidney tumors to perioperative and oncologic outcomes. Semantic segmentation of these tumors and their host kidneys is a promising tool for quantitatively characterizing these lesions, but its adoption is limited due to the manual effort required to produce high-quality 3D segmentations of these structures. Recently, methods based on deep learning have shown excellent results in automatic 3D segmentation, but they require large datasets for training, and there remains little consensus on which methods perform best. The 2019 Kidney and Kidney Tumor Segmentation challenge (KiTS19) was a competition held in conjunction with the 2019 International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) which sought to address these issues and stimulate progress on this automatic segmentation problem. A training set of 210 cross sectional CT images with kidney tumors was publicly released with corresponding semantic segmentation masks. 106 teams from five continents used this data to develop automated systems to predict the true segmentation masks on a test set of 90 CT images for which the corresponding ground truth segmentations were kept private. These predictions were scored and ranked according to their average So rensen-Dice coefficient between the kidney and tumor across all 90 cases. The winning team achieved a Dice of 0.974 for kidney and 0.851 for tumor, approaching the inter-annotator performance on kidney (0.983) but falling short on tumor (0.923). This challenge has now entered an "open leaderboard" phase where it serves as a challenging benchmark in 3D semantic segmentation.
△ Less
Submitted 7 August, 2020; v1 submitted 2 December, 2019;
originally announced December 2019.
-
The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design
Authors:
Jeffrey Dean
Abstract:
The past decade has seen a remarkable series of advances in machine learning, and in particular deep learning approaches based on artificial neural networks, to improve our abilities to build more accurate systems across a broad range of areas, including computer vision, speech recognition, language translation, and natural language understanding tasks. This paper is a companion paper to a keynote…
▽ More
The past decade has seen a remarkable series of advances in machine learning, and in particular deep learning approaches based on artificial neural networks, to improve our abilities to build more accurate systems across a broad range of areas, including computer vision, speech recognition, language translation, and natural language understanding tasks. This paper is a companion paper to a keynote talk at the 2020 International Solid-State Circuits Conference (ISSCC) discussing some of the advances in machine learning, and their implications on the kinds of computational devices we need to build, especially in the post-Moore's Law-era. It also discusses some of the ways that machine learning may also be able to help with some aspects of the circuit design process. Finally, it provides a sketch of at least one interesting direction towards much larger-scale multi-task models that are sparsely activated and employ much more dynamic, example- and task-based routing than the machine learning models of today.
△ Less
Submitted 12 November, 2019;
originally announced November 2019.
-
Accelerating Deep Learning by Focusing on the Biggest Losers
Authors:
Angela H. Jiang,
Daniel L. -K. Wong,
Giulio Zhou,
David G. Andersen,
Jeffrey Dean,
Gregory R. Ganger,
Gauri Joshi,
Michael Kaminksy,
Michael Kozuch,
Zachary C. Lipton,
Padmanabhan Pillai
Abstract:
This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration. Selective-Backprop uses the output of a training example's forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of co…
▽ More
This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration. Selective-Backprop uses the output of a training example's forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of computationally-expensive backpropagation steps performed, Selective-Backprop accelerates training. Evaluation on CIFAR10, CIFAR100, and SVHN, across a variety of modern image models, shows that Selective-Backprop converges to target error rates up to 3.5x faster than with standard SGD and between 1.02--1.8x faster than a state-of-the-art importance sampling approach. Further acceleration of 26% can be achieved by using stale forward pass results for selection, thus also skipping forward passes of low priority examples.
△ Less
Submitted 1 October, 2019;
originally announced October 2019.
-
MLSys: The New Frontier of Machine Learning Systems
Authors:
Alexander Ratner,
Dan Alistarh,
Gustavo Alonso,
David G. Andersen,
Peter Bailis,
Sarah Bird,
Nicholas Carlini,
Bryan Catanzaro,
Jennifer Chayes,
Eric Chung,
Bill Dally,
Jeff Dean,
Inderjit S. Dhillon,
Alexandros Dimakis,
Pradeep Dubey,
Charles Elkan,
Grigori Fursin,
Gregory R. Ganger,
Lise Getoor,
Phillip B. Gibbons,
Garth A. Gibson,
Joseph E. Gonzalez,
Justin Gottschlich,
Song Han,
Kim Hazelwood
, et al. (44 additional authors not shown)
Abstract:
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne…
▽ More
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.
△ Less
Submitted 1 December, 2019; v1 submitted 29 March, 2019;
originally announced April 2019.
-
The KiTS19 Challenge Data: 300 Kidney Tumor Cases with Clinical Context, CT Semantic Segmentations, and Surgical Outcomes
Authors:
Nicholas Heller,
Niranjan Sathianathen,
Arveen Kalapara,
Edward Walczak,
Keenan Moore,
Heather Kaluzniak,
Joel Rosenberg,
Paul Blake,
Zachary Rengel,
Makinna Oestreich,
Joshua Dean,
Michael Tradewell,
Aneri Shah,
Resha Tejpaul,
Zachary Edgerton,
Matthew Peterson,
Shaneabbas Raza,
Subodh Regmi,
Nikolaos Papanikolopoulos,
Christopher Weight
Abstract:
The morphometry of a kidney tumor revealed by contrast-enhanced Computed Tomography (CT) imaging is an important factor in clinical decision making surrounding the lesion's diagnosis and treatment. Quantitative study of the relationship between kidney tumor morphology and clinical outcomes is difficult due to data scarcity and the laborious nature of manually quantifying imaging predictors. Automa…
▽ More
The morphometry of a kidney tumor revealed by contrast-enhanced Computed Tomography (CT) imaging is an important factor in clinical decision making surrounding the lesion's diagnosis and treatment. Quantitative study of the relationship between kidney tumor morphology and clinical outcomes is difficult due to data scarcity and the laborious nature of manually quantifying imaging predictors. Automatic semantic segmentation of kidneys and kidney tumors is a promising tool towards automatically quantifying a wide array of morphometric features, but no sizeable annotated dataset is currently available to train models for this task. We present the KiTS19 challenge dataset: A collection of multi-phase CT imaging, segmentation masks, and comprehensive clinical outcomes for 300 patients who underwent nephrectomy for kidney tumors at our center between 2010 and 2018. 210 (70%) of these patients were selected at random as the training set for the 2019 MICCAI KiTS Kidney Tumor Segmentation Challenge and have been released publicly. With the presence of clinical context and surgical outcomes, this data can serve not only for benchmarking semantic segmentation models, but also for developing and studying biomarkers which make use of the imaging and semantic segmentation masks.
△ Less
Submitted 15 March, 2020; v1 submitted 31 March, 2019;
originally announced April 2019.
-
The Apache Point Observatory Galactic Evolution Experiment (APOGEE) Spectrographs
Authors:
J. C. Wilson,
F. R. Hearty,
M. F. Skrutskie,
S. R. Majewski,
J. A. Holtzman,
D. Eisenstein,
J. Gunn,
B. Blank,
C. Henderson,
S. Smee,
M. Nelson,
D. Nidever,
J. Arns,
R. Barkhouser,
J. Barr,
S. Beland,
M. A. Bershady,
M. R. Blanton,
S. Brunner,
A. Burton,
L. Carey,
M. Carr,
J. P. Colque,
J. Crane,
G. J. Damke
, et al. (64 additional authors not shown)
Abstract:
We describe the design and performance of the near-infrared (1.51--1.70 micron), fiber-fed, multi-object (300 fibers), high resolution (R = lambda/delta lambda ~ 22,500) spectrograph built for the Apache Point Observatory Galactic Evolution Experiment (APOGEE). APOGEE is a survey of ~ 10^5 red giant stars that systematically sampled all Milky Way populations (bulge, disk, and halo) to study the Ga…
▽ More
We describe the design and performance of the near-infrared (1.51--1.70 micron), fiber-fed, multi-object (300 fibers), high resolution (R = lambda/delta lambda ~ 22,500) spectrograph built for the Apache Point Observatory Galactic Evolution Experiment (APOGEE). APOGEE is a survey of ~ 10^5 red giant stars that systematically sampled all Milky Way populations (bulge, disk, and halo) to study the Galaxy's chemical and kinematical history. It was part of the Sloan Digital Sky Survey III (SDSS-III) from 2011 -- 2014 using the 2.5 m Sloan Foundation Telescope at Apache Point Observatory, New Mexico. The APOGEE-2 survey is now using the spectrograph as part of SDSS-IV, as well as a second spectrograph, a close copy of the first, operating at the 2.5 m du Pont Telescope at Las Campanas Observatory in Chile. Although several fiber-fed, multi-object, high resolution spectrographs have been built for visual wavelength spectroscopy, the APOGEE spectrograph is one of the first such instruments built for observations in the near-infrared. The instrument's successful development was enabled by several key innovations, including a "gang connector" to allow simultaneous connections of 300 fibers; hermetically sealed feedthroughs to allow fibers to pass through the cryostat wall continuously; the first cryogenically deployed mosaic volume phase holographic grating; and a large refractive camera that includes mono-crystalline silicon and fused silica elements with diameters as large as ~ 400 mm. This paper contains a comprehensive description of all aspects of the instrument including the fiber system, optics and opto-mechanics, detector arrays, mechanics and cryogenics, instrument control, calibration system, optical performance and stability, lessons learned, and design changes for the second instrument.
△ Less
Submitted 3 February, 2019;
originally announced February 2019.
-
Microscope 2.0: An Augmented Reality Microscope with Real-time Artificial Intelligence Integration
Authors:
Po-Hsuan Cameron Chen,
Krishna Gadepalli,
Robert MacDonald,
Yun Liu,
Kunal Nagpal,
Timo Kohlberger,
Jeffrey Dean,
Greg S. Corrado,
Jason D. Hipp,
Martin C. Stumpe
Abstract:
The brightfield microscope is instrumental in the visual examination of both biological and physical samples at sub-millimeter scales. One key clinical application has been in cancer histopathology, where the microscopic assessment of the tissue samples is used for the diagnosis and staging of cancer and thus guides clinical therapy. However, the interpretation of these samples is inherently subje…
▽ More
The brightfield microscope is instrumental in the visual examination of both biological and physical samples at sub-millimeter scales. One key clinical application has been in cancer histopathology, where the microscopic assessment of the tissue samples is used for the diagnosis and staging of cancer and thus guides clinical therapy. However, the interpretation of these samples is inherently subjective, resulting in significant diagnostic variability. Moreover, in many regions of the world, access to pathologists is severely limited due to lack of trained personnel. In this regard, Artificial Intelligence (AI) based tools promise to improve the access and quality of healthcare. However, despite significant advances in AI research, integration of these tools into real-world cancer diagnosis workflows remains challenging because of the costs of image digitization and difficulties in deploying AI solutions. Here we propose a cost-effective solution to the integration of AI: the Augmented Reality Microscope (ARM). The ARM overlays AI-based information onto the current view of the sample through the optical pathway in real-time, enabling seamless integration of AI into the regular microscopy workflow. We demonstrate the utility of ARM in the detection of lymph node metastases in breast cancer and the identification of prostate cancer with a latency that supports real-time workflows. We anticipate that ARM will remove barriers towards the use of AI in microscopic analysis and thus improve the accuracy and efficiency of cancer diagnosis. This approach is applicable to other microscopy tasks and AI algorithms in the life sciences and beyond.
△ Less
Submitted 4 December, 2018; v1 submitted 21 November, 2018;
originally announced December 2018.