Page MenuHomePhabricator

Enable NLLB-200 for languages lacking other Machine Translation options
Closed, ResolvedPublic

Description

The NLLB-200 model supports many languages. However, the initial integration in Content Translation supports a smaller set of 23 languages. In order to improve machine translation support, we want to support languages lack currently machine translation support by the current services available but could be provided using NLLB-200.

Candidate languages identified so far
These languages have no current MT support enabled, but NLLB-200 could provide it :

  1. Kashmiri (ks/kas_Arab) ( view request and T326541 ) — ✅ Enabled as part of T337290.
  2. Santali (sat/sat_Olck) ( view request) — ✅ Enabled as part of T337290.
  3. Tumbuka (tum/tum_Latn) (view request) — ✅ Enabled as part of T337290.
  4. Fulah/Nigerian Fulfulde (ff/fuv_Latn) — ✅ Enabled as part of T337290.
  5. Kabyle (kab/kab_Latn) — ✅ Enabled as part of T337290.
  6. Balinese (ban/ban_Latn) — ✅ Enabled as part of T337290.
  7. Banjar (bjn/bjn_Latn) — ✅ Enabled as part of T337290.
  8. South Azerbaijani (azb/azb_Arab) — ✅ Enabled as part of T337290.
  9. Pangasinan (pag/pag_Latn) — ✅ Enabled as part of T337290.
  10. Tibetan (bo/bod_Tibt) — ✅ Enabled as part of T337290.
  11. Crimean Tatar (crh/crh_Latn) – Planned as part of T337669.
  12. Lombard (lmo/lmo_Latn) – ✅ Enabled as part of T337669.
  13. Silesian (szl/szl_Latn) – ✅ Enabled as part of T337669.
  14. Venetian (vec/vec_Latn) – ✅ Enabled as part of T337669.
  15. Ligurian (lij/lij_Latn) – ✅ Enabled as part of T337669.
  16. Waray (war/war_Latn) – ✅ Enabled as part of T337669.
  17. Limburgish (li/lim_Latn) – ✅ Enabled as part of T337669.
  18. Faroese (fo/fao_Latn) – ✅ Enabled as part of T337669.
  19. Shan (shn/shn_Mymr) – ✅ Enabled as part of T337669.
  20. Friulian (fur/fur_Latn) – ✅ Enabled as part of T337669.
  21. Sicilian (scn/scn_Latn) – ✅ Enabled as part of T337834.
  22. Acehnese (ace/ace_Latn) – ✅ Enabled as part of T337834.
  23. Buginese (bug/bug_Latn) – ✅ Enabled as part of T337834.
  24. Tok Pisin (tpi/tpi_Latn) – ✅ Enabled as part of T337834.
  25. Fijian (fj/fij_Latn) – ✅ Enabled as part of T337834.
  26. Southwestern Dinka (din/dik_Latn) – ✅ Enabled as part of T337834.
  27. Rundi (rn/run_Latn) – ✅ Enabled as part of T337834.
  28. Kabiyè (kbp/kbp_Latn) – ✅ Enabled as part of T337834.
  29. Latgalian (ltg/ltg_Latn) – ✅ Enabled as part of T337834.
  30. Dzongkha (dz/dzo_Tibt) – ✅ Enabled as part of T337834.
  31. Egyptian Arabic (arz/arz_Arab) – ✅ Enabled as part of T338123.
  32. Moroccan Arabic (ary/ary_Arab) – ✅ Enabled as part of T338123.
  33. Kikuyu (ki/kik_Latn) – ✅ Enabled as part of T338123.
  34. Sango (sg/sag_Latn) – ✅ Enabled as part of T338123.
  35. Awadhi (awa/awa_Deva) – ✅ Enabled as part of T338123.
  36. Minangkabau (min/min_Latn) – ✅ Enabled as part of T340953
  37. Sardinian (sc/srd_Latn) – ✅ Enabled as part of T340953
  • Fon (fon/fon_Latn) (still in incubator, considered for T336683)
  • Akan (ak/aka_Latn) (Akan may be supported using Twi)
  • Kanuri/Central Kanuri (kr/knc_Arab/knc_Latn) (Wikipedia was closed, back to incubator, considered for T336683)

Languages already supported by NLLB-200 as their only option

From the list of languages currently supported by NLLB-200, these are those not supported by other services:

  1. Asturian (ast/ast_Latn)
  2. Kongo/Kikongo (kg/kon_Latn)
  3. Northern Sotho (nso/nso_Latn)
  4. Occitan (oc/oci_Latn)
  5. Swati (ss/ssw_Latn)
  6. Tswana (tn/tsn_Latn)
  7. Wolof (wo/wol_Latn)
  8. Cantonese/Yue Chinese (zh-yue/yue_Hant) (disabled as per T333835)

( Central Bikol is another language with MinT as the only option, but supported by Opus MT instead: T262253)

Languages listed on the NLLB-200 documentation
These languages are supported by the NLLB-200 model.
Marking in bold those wthout MT, and striked those with MT existing MT support.

  • Acehnese (Arabic script) (ace_Arab)
  • Acehnese (Latin script) (ace_Latn)
  • Mesopotamian Arabic (acm_Arab) no wiki yet
  • Ta’izzi-Adeni Arabic (acq_Arab) no wiki yet
  • Tunisian Arabic (aeb_Arab) no wiki yet
  • Afrikaans (afr_Latn)
  • South Levantine Arabic (ajp_Arab) no wiki yet
  • Akan (aka_Latn)
  • Amharic (amh_Ethi)
  • North Levantine Arabic (apc_Arab) no wiki yet
  • Modern Standard Arabic (arb_Arab)
  • Modern Standard Arabic (Romanized) (arb_Latn)
  • Najdi Arabic (ars_Arab) no wiki yet
  • Moroccan Arabic (ary_Arab)
  • Egyptian Arabic (arz_Arab)
  • Assamese (asm_Beng)
  • Asturian (ast_Latn)
  • Awadhi (awa_Deva)
  • Central Aymara (ayr_Latn)
  • South Azerbaijani (azb_Arab)
  • North Azerbaijani (azj_Latn)
  • Bashkir (bak_Cyrl)
  • Bambara (bam_Latn)
  • Balinese (ban_Latn)
  • Belarusian (bel_Cyrl)
  • Bemba (bem_Latn) no wiki yet
  • Bengali (ben_Beng)
  • Bhojpuri (bho_Deva)
  • Banjar (Arabic script) (bjn_Arab)
  • Banjar (Latin script) (bjn_Latn)
  • Standard Tibetan (bod_Tibt)
  • Bosnian (bos_Latn)
  • Buginese (bug_Latn)
  • Bulgarian (bul_Cyrl)
  • Catalan (cat_Latn)
  • Cebuano (ceb_Latn)
  • Czech (ces_Latn)
  • Chokwe (cjk_Latn) no wiki yet
  • Central Kurdish (ckb_Arab)
  • Crimean Tatar (crh_Latn)
  • Welsh (cym_Latn)
  • Danish (dan_Latn)
  • German (deu_Latn)
  • Southwestern Dinka (dik_Latn)
  • Dyula (dyu_Latn) no wiki yet
  • Dzongkha (dzo_Tibt)
  • Greek (ell_Grek)
  • English (eng_Latn)
  • Esperanto (epo_Latn)
  • Estonian (est_Latn)
  • Basque (eus_Latn)
  • Ewe (ewe_Latn)
  • Faroese (fao_Latn)
  • Fijian (fij_Latn)
  • Finnish (fin_Latn)
  • Fon (fon_Latn) wiki still in incubator
  • French (fra_Latn)
  • Friulian (fur_Latn)
  • Nigerian Fulfulde (fuv_Latn)
  • Scottish Gaelic (gla_Latn)
  • Irish (gle_Latn)
  • Galician (glg_Latn)
  • Guarani (grn_Latn)
  • Gujarati (guj_Gujr)
  • Haitian Creole (hat_Latn)
  • Hausa (hau_Latn)
  • Hebrew (heb_Hebr)
  • Hindi (hin_Deva)
  • Chhattisgarhi (hne_Deva) no wiki yet
  • Croatian (hrv_Latn)
  • Hungarian (hun_Latn)
  • Armenian (hye_Armn)
  • Igbo (ibo_Latn)
  • Ilocano (ilo_Latn)
  • Indonesian (ind_Latn)
  • Icelandic (isl_Latn)
  • Italian (ita_Latn)
  • Javanese (jav_Latn)
  • Japanese (jpn_Jpan)
  • Kabyle (kab_Latn)
  • Jingpho (kac_Latn) no wiki yet
  • Kamba (kam_Latn) no wiki yet
  • Kannada (kan_Knda)
  • Kashmiri (Arabic script) (kas_Arab)
  • Kashmiri (Devanagari script) (kas_Deva)
  • Georgian (kat_Geor)
  • Central Kanuri (Arabic script) (knc_Arab)
  • Central Kanuri (Latin script) (knc_Latn)
  • Kazakh (kaz_Cyrl)
  • Kabiyè (kbp_Latn)
  • Kabuverdianu (kea_Latn) no wiki yet
  • Khmer (khm_Khmr)
  • Kikuyu (kik_Latn)
  • Kinyarwanda (kin_Latn)
  • Kyrgyz (kir_Cyrl)
  • Kimbundu (kmb_Latn) no wiki yet
  • Northern Kurdish (kmr_Latn)
  • Kikongo (kon_Latn)
  • Korean (kor_Hang)
  • Lao (lao_Laoo)
  • Ligurian (lij_Latn)
  • Limburgish (lim_Latn)
  • Lingala (lin_Latn)
  • Lithuanian (lit_Latn)
  • Lombard (lmo_Latn)
  • Latgalian (ltg_Latn)
  • Luxembourgish (ltz_Latn)
  • Luba-Kasai (lua_Latn) no wiki yet
  • Ganda (lug_Latn)
  • Luo (luo_Latn) no wiki yet
  • Mizo (lus_Latn) no wiki yet
  • Standard Latvian (lvs_Latn)
  • Magahi (mag_Deva) no wiki yet
  • Maithili (mai_Deva)
  • Malayalam (mal_Mlym)
  • Marathi (mar_Deva)
  • Minangkabau (Arabic script) (min_Arab)
  • Minangkabau (Latin script) (min_Latn)
  • Macedonian (mkd_Cyrl)
  • Plateau Malagasy (plt_Latn)
  • Maltese (mlt_Latn)
  • Meitei (Bengali script) (mni_Beng)
  • Halh Mongolian (khk_Cyrl)
  • Mossi (mos_Latn) no wiki yet
  • Maori (mri_Latn)
  • Burmese (mya_Mymr)
  • Dutch (nld_Latn)
  • Norwegian Nynorsk (nno_Latn)
  • Norwegian Bokmål (nob_Latn)
  • Nepali (npi_Deva)
  • Northern Sotho (nso_Latn)
  • Nuer (nus_Latn) no wiki yet
  • Nyanja (nya_Latn)
  • Occitan (oci_Latn)
  • West Central Oromo (gaz_Latn)
  • Odia (ory_Orya)
  • Pangasinan (pag_Latn)
  • Eastern Panjabi (pan_Guru)
  • Papiamento (pap_Latn)
  • Western Persian (pes_Arab)
  • Polish (pol_Latn)
  • Portuguese (por_Latn)
  • Dari (prs_Arab) no wiki yet
  • Southern Pashto (pbt_Arab)
  • Ayacucho Quechua (quy_Latn)
  • Romanian (ron_Latn)
  • Rundi (run_Latn)
  • Russian (rus_Cyrl)
  • Sango (sag_Latn)
  • Sanskrit (san_Deva)
  • Santali (sat_Olck)
  • Sicilian (scn_Latn)
  • Shan (shn_Mymr)
  • Sinhala (sin_Sinh)
  • Slovak (slk_Latn)
  • Slovenian (slv_Latn)
  • Samoan (smo_Latn)
  • Shona (sna_Latn)
  • Sindhi (snd_Arab)
  • Somali (som_Latn)
  • Southern Sotho (sot_Latn)
  • Spanish (spa_Latn)
  • Tosk Albanian (als_Latn)
  • Sardinian (srd_Latn)
  • Serbian (srp_Cyrl)
  • Swati (ssw_Latn)
  • Sundanese (sun_Latn)
  • Swedish (swe_Latn)
  • Swahili (swh_Latn)
  • Silesian (szl_Latn)
  • Tamil (tam_Taml)
  • Tatar (tat_Cyrl)
  • Telugu (tel_Telu)
  • Tajik (tgk_Cyrl)
  • Tagalog (tgl_Latn)
  • Thai (tha_Thai)
  • Tigrinya (tir_Ethi)
  • Tamasheq (Latin script) (taq_Latn) no wiki yet
  • Tamasheq (Tifinagh script) (taq_Tfng) no wiki yet
  • Tok Pisin (tpi_Latn)
  • Tswana (tsn_Latn)
  • Tsonga (tso_Latn)
  • Turkmen (tuk_Latn)
  • Tumbuka (tum_Latn)
  • Turkish (tur_Latn)
  • Twi (twi_Latn)
  • Central Atlas Tamazight (tzm_Tfng) no wiki yet
  • Uyghur (uig_Arab)
  • Ukrainian (ukr_Cyrl)
  • Umbundu (umb_Latn) no wiki yet
  • Urdu (urd_Arab)
  • Northern Uzbek (uzn_Latn)
  • Venetian (vec_Latn)
  • Vietnamese (vie_Latn)
  • Waray (war_Latn)
  • Wolof (wol_Latn)
  • Xhosa (xho_Latn)
  • Eastern Yiddish (ydd_Hebr)
  • Yoruba (yor_Latn)
  • Yue Chinese (yue_Hant)
  • Chinese (Simplified) (zho_Hans)
  • Chinese (Traditional) (zho_Hant)
  • Standard Malay (zsm_Latn)
  • Zulu (zul_Latn)

Full list of languages supported by NLLB-200 in 3-letter ISO codes:

1ace
2acm
3acq
4aeb
5afr
6ajp
7aka
8amh
9apc
10arb
11ars
12ary
13arz
14asm
15ast
16awa
17ayr
18azb
19azj
20bak
21bam
22ban
23bel
24bem
25ben
26bho
27bjn
28bod
29bos
30bug
31bul
32cat
33ceb
34ces
35cjk
36ckb
37crh
38cym
39dan
40deu
41dik
42dyu
43dzo
44ell
45eng
46epo
47est
48eus
49ewe
50fao
51fij
52fin
53fon
54fra
55fur
56fuv
57gla
58gle
59glg
60grn
61guj
62hat
63hau
64heb
65hin
66hne
67hrv
68hun
69hye
70ibo
71ilo
72ind
73isl
74ita
75jav
76jpn
77kab
78kac
79kam
80kan
81kas
82kat
83knc
84kaz
85kbp
86kea
87khm
88kik
89kin
90kir
91kmb
92kmr
93kon
94kor
95lao
96lij
97lim
98lin
99lit
100lmo
101ltg
102ltz
103lua
104lug
105luo
106lus
107lvs
108mag
109mai
110mal
111mar
112min
113mkd
114plt
115mlt
116mni
117khk
118mos
119mri
120mya
121nld
122nno
123nob
124npi
125nso
126nus
127nya
128oci
129gaz
130ory
131pag
132pan
133pap
134pes
135pol
136por
137prs
138pbt
139quy
140ron
141run
142rus
143sag
144san
145sat
146scn
147shn
148sin
149slk
150slv
151smo
152sna
153snd
154som
155sot
156spa
157als
158srd
159srp
160ssw
161sun
162swe
163swh
164szl
165tam
166tat
167tel
168tgk
169tgl
170tha
171tir
172taq
173tpi
174tsn
175tso
176tuk
177tum
178tur
179twi
180tzm
181uig
182ukr
183umb
184urd
185uzn
186vec
187vie
188war
189wol
190xho
191ydd
192yor
193yue
194zho
195zsm
196zul


Related: T336683: Enable MinT support for languages with no Wikipedia yet

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For the candidate languages I looked into some stats for last year (2022):

  • They represent a total of 157.5M views, and 2.6K translations were created with Content Translation.
  • Egyptian Arabic (arz) is the language with the most activity with a significant difference (63M views, 1.8K translations). It is also the only one with some MT support available (standard Arabic MT is provided in the lack of specific Egyptian Arabic).
  • Excluding Egyptian Arabic, the rest of the languages represent 93.7M views and 861 translations.

Views per language in 2022

turnilo.wikimedia.org_.png (1×1 px, 181 KB)

Translations per language in 2022

turnilo.wikimedia.org_ (1).png (1×1 px, 120 KB)

Pginer-WMF triaged this task as Medium priority.Jan 11 2023, 3:31 PM
Pginer-WMF updated the task description. (Show Details)
Pginer-WMF updated the task description. (Show Details)

Some languages names listed above are just (a/the) standard dialect of some language with a wiki:

  • North Azerbaijani = Azerbaijani (az)
  • Halh Mongolian = Mongolian (mn)
  • Plateau Malagasy = Malagasy (mg)
  • Western Persian = Persian (fa)
  • Tosk Albanian = Albanian (sq)
  • Northern Uzbek = Uzbek (uz)
  • Standard Malay = Malay (ms)

Some are (or may be) included in a more general wiki:

  • Eastern Yiddish => Yiddish (yi)

Some languages names listed above are just (a/the) standard dialect of some language with a wiki...

Thanks for the clarifications! I captured those in the description.
Since the main focus of the ticket is on languages for which a wiki exist but MT was not available, there are no changes in the list of selected languages. However, having a more accurate list of the mapping of existing wikis will be very useful in the future too. Thanks!

@Tumbuka_Arch You had emailed the language team members about English->Tumbuka Machine translation. As you can see Timbuka is a candidate for NLLB 200 MT service which we are trying to expand to more languages using this approach: T331505: Self hosted machine translation service

Note Akan Wikipedia is closed and migrated to Twi (tw) Wikipedia.