Zum Hauptinhalt springen

Showing 1–5 of 5 results for author: Glorioso, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.01981  [pdf, other

    cs.CL cs.AI

    Zyda: A 1.3T Dataset for Open Language Modeling

    Authors: Yury Tokpanov, Beren Millidge, Paolo Glorioso, Jonathan Pilault, Adam Ibrahim, James Whittington, Quentin Anthony

    Abstract: The size of large language models (LLMs) has scaled dramatically in recent years and their computational and data requirements have surged correspondingly. State-of-the-art language models, even at relatively smaller sizes, typically require training on at least a trillion tokens. This rapid advancement has eclipsed the growth of open-source datasets available for large-scale LLM pretraining. In t… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  2. arXiv:2405.16712  [pdf, other

    cs.LG cs.AI cs.CL

    Zamba: A Compact 7B SSM Hybrid Model

    Authors: Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, Beren Millidge

    Abstract: In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which achieves competitive performance against leading open-weight models at a comparable scale. Zamba is trained on 1T tokens from openly available datasets and is the best non-transformer model at this scale. Zamba pioneers a unique architecture combining a Mamba backbone with a single shared attention module, th… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  3. arXiv:2403.17887  [pdf, other

    cs.CL cs.LG stat.ML

    The Unreasonable Ineffectiveness of the Deeper Layers

    Authors: Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts

    Abstract: We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: 12 + 10 pages, 5 + 4 figures

    Report number: MIT-CTP/5694

  4. arXiv:2402.01771  [pdf, other

    cs.CL cs.AI cs.DC cs.LG

    BlackMamba: Mixture of Experts for State-Space Models

    Authors: Quentin Anthony, Yury Tokpanov, Paolo Glorioso, Beren Millidge

    Abstract: State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks while achieving linear time and memory complexity as a function of sequence length. Mamba, a recently released SSM model, shows impressive performance in both language modeling and long sequence processing tasks. Simultaneously, mixture-of-expert (MoE) models hav… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

  5. arXiv:2210.16400  [pdf, other

    cs.LG cond-mat.dis-nn cond-mat.stat-mech

    Flatter, faster: scaling momentum for optimal speedup of SGD

    Authors: Aditya Cowsik, Tankut Can, Paolo Glorioso

    Abstract: Commonly used optimization algorithms often show a trade-off between good generalization and fast training times. For instance, stochastic gradient descent (SGD) tends to have good generalization; however, adaptive gradient methods have superior training times. Momentum can help accelerate training with SGD, but so far there has been no principled way to select the momentum hyperparameter. Here we… ▽ More

    Submitted 13 June, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: v2: expanded introduction section, corrected minor typos. v1: 12+13 pages, 3 figures