Zum Hauptinhalt springen

Showing 1–2 of 2 results for author: Shota, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.17790  [pdf, other

    cs.CL cs.AI

    Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

    Authors: Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki

    Abstract: Cross-lingual continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost. In this study, we constructed Swallow, an LLM with enhanced Japanese capability, by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on a… ▽ More

    Submitted 27 April, 2024; originally announced April 2024.

  2. arXiv:2404.17733  [pdf, other

    cs.CL cs.AI

    Building a Large Japanese Web Corpus for Large Language Models

    Authors: Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki

    Abstract: Open Japanese large language models (LLMs) have been trained on the Japanese portions of corpora such as CC-100, mC4, and OSCAR. However, these corpora were not created for the quality of Japanese texts. This study builds a large Japanese web corpus by extracting and refining text from the Common Crawl archive (21 snapshots of approximately 63.4 billion pages crawled between 2020 and 2023). This c… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: 17 pages