Zum Hauptinhalt springen

Showing 1–6 of 6 results for author: Jaszczur, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.07871  [pdf, other

    cs.LG cs.AI cs.CL

    Scaling Laws for Fine-Grained Mixture of Experts

    Authors: Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, Sebastian Jaszczur

    Abstract: Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling la… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  2. arXiv:2401.04081  [pdf, other

    cs.LG cs.AI cs.CL

    MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

    Authors: Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Michał Krutul, Jakub Krajewski, Szymon Antoniak, Piotr Miłoś, Marek Cygan, Sebastian Jaszczur

    Abstract: State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcas… ▽ More

    Submitted 26 February, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

  3. arXiv:2312.17296  [pdf, other

    cs.CL

    Structured Packing in LLM Training Improves Long Context Utilization

    Authors: Konrad Staniszewski, Szymon Tworkowski, Sebastian Jaszczur, Yu Zhao, Henryk Michalewski, Łukasz Kuciński, Piotr Miłoś

    Abstract: Recent advancements in long-context large language models have attracted significant attention, yet their practical applications often suffer from suboptimal context utilization. This study investigates structuring training data to enhance semantic interdependence, demonstrating that this approach effectively improves context utilization. To this end, we introduce the Structured Packing for Long C… ▽ More

    Submitted 24 June, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

    Comments: new experiments with a 13B model

  4. arXiv:2310.15961  [pdf, other

    cs.CL cs.LG

    Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation

    Authors: Szymon Antoniak, Sebastian Jaszczur, Michał Krutul, Maciej Pióro, Jakub Krajewski, Jan Ludziejewski, Tomasz Odrzygóźdź, Marek Cygan

    Abstract: Despite the promise of Mixture of Experts (MoE) models in increasing parameter counts of Transformer models while maintaining training and inference costs, their application carries notable drawbacks. The key strategy of these models is to, for each processed token, activate at most a few experts - subsets of an extensive feed-forward layer. But this approach is not without its challenges. The ope… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  5. arXiv:2111.12763  [pdf, other

    cs.LG cs.CL

    Sparse is Enough in Scaling Transformers

    Authors: Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva

    Abstract: Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to sca… ▽ More

    Submitted 24 November, 2021; originally announced November 2021.

    Comments: NeurIPS 2021

  6. arXiv:2005.13406  [pdf, other

    cs.AI

    Neural heuristics for SAT solving

    Authors: Sebastian Jaszczur, Michał Łuszczyk, Henryk Michalewski

    Abstract: We use neural graph networks with a message-passing architecture and an attention mechanism to enhance the branching heuristic in two SAT-solving algorithms. We report improvements of learned neural heuristics compared with two standard human-designed heuristics.

    Submitted 27 May, 2020; originally announced May 2020.