OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

Liu, Yihong; Lin, Peiqin; Wang, Mingyang; Schütze, Hinrich

Computer Science > Computation and Language

arXiv:2311.08849 (cs)

[Submitted on 15 Nov 2023 (v1), last revised 25 Mar 2024 (this version, v2)]

Title:OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

Authors:Yihong Liu, Peiqin Lin, Mingyang Wang, Hinrich Schütze

View PDF HTML (experimental)

Abstract:Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To address these issues, we propose a novel framework: $\textbf{O}$ne $\textbf{F}$or $\textbf{A}$ll ($\textbf{OFA}$), which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively. OFA takes advantage of external well-aligned multilingual static word vectors and injects the alignment knowledge into the subword embeddings. In addition, OFA applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which largely reduces the number of parameters. We show OFA accelerates the convergence of continued pretraining, which is environmentally friendly as much fewer carbon footprints are generated. Through extensive experiments, we demonstrate OFA can achieve competitive or better performance than default continued pretraining baselines on a wide range of crosslingual downstream tasks. We make our code and models publicly available.

Comments:	NAACL 2024 Findings
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2311.08849 [cs.CL]
	(or arXiv:2311.08849v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.08849

Submission history

From: Yihong Liu [view email]
[v1] Wed, 15 Nov 2023 10:40:45 UTC (288 KB)
[v2] Mon, 25 Mar 2024 15:49:53 UTC (304 KB)

Computer Science > Computation and Language

Title:OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators