Zum Hauptinhalt springen

Showing 1–10 of 10 results for author: Osawa, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1110 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 8 August, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  2. arXiv:2305.04684  [pdf, other

    cs.LG

    ASDL: A Unified Interface for Gradient Preconditioning in PyTorch

    Authors: Kazuki Osawa, Satoki Ishikawa, Rio Yokota, Shigang Li, Torsten Hoefler

    Abstract: Gradient preconditioning is a key technique to integrate the second-order information into gradients for improving and extending gradient-based learning algorithms. In deep learning, stochasticity, nonconvexity, and high dimensionality lead to a wide variety of gradient preconditioning methods, with implementation complexity and inconsistent performance and feasibility. We propose the Automatic Se… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

  3. arXiv:2211.14133  [pdf, other

    cs.LG

    PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices

    Authors: Kazuki Osawa, Shigang Li, Torsten Hoefler

    Abstract: Pipeline parallelism enables efficient training of Large Language Models (LLMs) on large-scale distributed accelerator clusters. Yet, pipeline bubbles during startup and tear-down reduce the utilization of accelerators. Although efficient pipeline schemes with micro-batching and bidirectional pipelines have been proposed to maximize utilization, a significant number of bubbles cannot be filled usi… ▽ More

    Submitted 13 May, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

    Comments: MLSys 2023

  4. arXiv:2210.02720  [pdf, other

    cs.LG stat.ML

    Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias

    Authors: Ryo Karakida, Tomoumi Takase, Tomohiro Hayase, Kazuki Osawa

    Abstract: Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. While some studies have reported that GR can improve generalization performance, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve the performance. In this study, we first reveal that a specific finite-difference… ▽ More

    Submitted 2 February, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

  5. arXiv:2209.09732  [pdf, other

    cs.LG cs.DB

    Neural Graph Databases

    Authors: Maciej Besta, Patrick Iff, Florian Scheidl, Kazuki Osawa, Nikoli Dryden, Michal Podstawski, Tiancheng Chen, Torsten Hoefler

    Abstract: Graph databases (GDBs) enable processing and analysis of unstructured, complex, rich, and usually vast graph datasets. Despite the large significance of GDBs in both academia and industry, little effort has been made into integrating them with the predictive power of graph neural networks (GNNs). In this work, we show how to seamlessly combine nearly any GNN model with the computational capabiliti… ▽ More

    Submitted 24 November, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

    Journal ref: Learning on Graphs (LOG) 2022

  6. Efficient Quantized Sparse Matrix Operations on Tensor Cores

    Authors: Shigang Li, Kazuki Osawa, Torsten Hoefler

    Abstract: The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedu… ▽ More

    Submitted 7 May, 2023; v1 submitted 14 September, 2022; originally announced September 2022.

    Comments: Published in Proceedings of 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'22), No.: 37, Pages 1-15, Best Paper Finalist, https://dl.acm.org/doi/10.5555/3571885.3571934 (In this arXiv verion, we fix a typo at the bottom right of Page 6: For SDDMM, each thread block needs $\textbf{K/BS}$$_k$ steps to obtain the final results; we fix Table 3.)

    ACM Class: C.1.4; I.2.11

  7. arXiv:2010.00879  [pdf, other

    stat.ML cond-mat.dis-nn cs.LG

    Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks

    Authors: Ryo Karakida, Kazuki Osawa

    Abstract: Natural Gradient Descent (NGD) helps to accelerate the convergence of gradient descent dynamics, but it requires approximations in large-scale deep neural networks because of its high computational cost. Empirical studies have confirmed that some NGD methods with approximate Fisher information converge sufficiently fast in practice. Nevertheless, it remains unclear from the theoretical perspective… ▽ More

    Submitted 7 December, 2020; v1 submitted 2 October, 2020; originally announced October 2020.

    Comments: NeurIPS 2020

  8. arXiv:2002.06015  [pdf, other

    cs.LG stat.ML

    Scalable and Practical Natural Gradient for Large-Scale Deep Learning

    Authors: Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Chuan-Sheng Foo, Rio Yokota

    Abstract: Large-scale distributed training of deep neural networks results in models with worse generalization performance as a result of the increase in the effective mini-batch size. Previous approaches attempt to address this problem by varying the learning rate and batch size over epochs and layers, or ad hoc modifications of batch normalization. We propose Scalable and Practical Natural Gradient Descen… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: arXiv admin note: text overlap with arXiv:1811.12019

  9. arXiv:1906.02506  [pdf, other

    stat.ML cs.LG

    Practical Deep Learning with Bayesian Principles

    Authors: Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E. Turner, Rio Yokota, Mohammad Emtiyaz Khan

    Abstract: Bayesian methods promise to fix many shortcomings of deep learning, but they are impractical and rarely match the performance of standard methods, let alone improve them. In this paper, we demonstrate practical training of deep networks with natural-gradient variational inference. By applying techniques such as batch normalisation, data augmentation, and distributed training, we achieve similar pe… ▽ More

    Submitted 29 October, 2019; v1 submitted 6 June, 2019; originally announced June 2019.

    Comments: NeurIPS 2019

  10. arXiv:1811.12019  [pdf, other

    cs.LG cs.CV stat.ML

    Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks

    Authors: Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, Satoshi Matsuoka

    Abstract: Large-scale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective mini-batch size. Previous approaches try to solve this problem by varying the learning rate and batch size over epochs and layers, or some ad hoc modification of the batch normalization. We propose an alternative approach using a second-order optimization method that… ▽ More

    Submitted 30 March, 2019; v1 submitted 29 November, 2018; originally announced November 2018.

    Comments: 10 pages, 7 figures. Accepted at CVPR 2019, Long Beach, CA