Zum Hauptinhalt springen

Showing 1–3 of 3 results for author: Samin, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2106.13822  [pdf, other

    cs.CL

    XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

    Authors: Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Islam, Kazi Samin, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, Rifat Shahriyar

    Abstract: Contemporary works on abstractive text summarization have focused primarily on high-resource languages like English, mostly due to the limited availability of datasets for low/mid-resource ones. In this work, we present XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. Th… ▽ More

    Submitted 25 June, 2021; originally announced June 2021.

    Comments: Findings of the Association for Computational Linguistics, ACL 2021 (camera-ready)

  2. arXiv:2101.00204  [pdf, other

    cs.CL

    BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

    Authors: Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Kazi Samin, Md Saiful Islam, Anindya Iqbal, M. Sohel Rahman, Rifat Shahriyar

    Abstract: In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answ… ▽ More

    Submitted 10 May, 2022; v1 submitted 1 January, 2021; originally announced January 2021.

    Comments: Findings of North American Chapter of the Association for Computational Linguistics, NAACL 2022 (camera-ready)

  3. arXiv:2009.09359  [pdf, other

    cs.CL

    Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation

    Authors: Tahmid Hasan, Abhik Bhattacharjee, Kazi Samin, Masum Hasan, Madhusudan Basak, M. Sohel Rahman, Rifat Shahriyar

    Abstract: Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources. Most publicly available parallel corpora for Bengali are not large enough; and have rather poor quality, mostly because of incorrect sentence alignments resulting from erroneous sentence segmentation, and also because of a hig… ▽ More

    Submitted 7 October, 2020; v1 submitted 20 September, 2020; originally announced September 2020.

    Comments: EMNLP 2020