Zum Hauptinhalt springen

Showing 1–1 of 1 results for author: Nahata, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.12481  [pdf, other

    cs.CL

    Pretraining Data and Tokenizer for Indic LLM

    Authors: Rahul Kumar, Shubham Kakde, Divyansh Rajput, Daud Ibrahim, Rishabh Nahata, Pidathala Sowjanya, Deepak Kumar

    Abstract: We present a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia, ensuring a diverse and rich linguistic representation. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redunda… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.