Scaling Sign Language Translation

Biao Zhang¹ Garrett Tanzer² Orhan Firat¹
¹ Google DeepMind ² Google
{biaojiaxing,gtanzer,orhanf}@google.com

Abstract

Sign language translation (SLT) addresses the problem of translating information from a sign language in video to a spoken language in text. Existing studies, while showing progress, are often limited to narrow domains and/or few sign languages and struggle with open-domain tasks. In this paper, we push forward the frontier of SLT by scaling pretraining data, model size, and number of translation directions. We perform large-scale SLT pretraining on different data including 1) noisy multilingual YouTube SLT data, 2) parallel text corpora, and 3) SLT data augmented by translating video captions to other languages with off-the-shelf machine translation models. We unify different pretraining tasks with task-specific prompts under the encoder-decoder architecture, and initialize the SLT model with pretrained (m/By)T5 models across model sizes. SLT pretraining results on How2Sign and FLEURS-ASL#0 (ASL to 42 spoken languages) demonstrate the significance of data/model scaling and cross-lingual cross-modal transfer, as well as the feasibility of zero-shot SLT. We finetune the pretrained SLT models on 5 downstream open-domain SLT benchmarks covering 5 sign languages. Experiments show substantial quality improvements over the vanilla baselines, surpassing the previous state-of-the-art (SOTA) by wide margins.

Refer to caption — Figure 1: BLEU scores on different benchmarks: our model sets new SOTA results across benchmarks and sign languages. Note we didn’t show BLEURT because not all previous studies report BLEURT.

1 Introduction

Scalable neural networks trained on large amount of unlabeled and/or weakly-labeled data from multiple modalities and multiple tasks have resulted in performance significantly exceeding that of single-task models trained on particular domains [30, 11, 33, 18]. Sign language translation (SLT), as a video-to-text translation task¹¹1While spoken language can be conveyed through either text or speech, this study focuses on text., features significant cross-modality challenges in video understanding and text generation. While extra forms of supervision such as glosses have been helpful in bridging the modality gap [4], they are nonstandardized/incomplete systems available only for small datasets [12]. Researchers have instead turned to more scalable approaches such as adapting pretrained vision and text models [7, 37, 49] and jointly modeling with machine translation [MT, 57]. Despite encouraging progress, these studies were performed at small scale with success on narrowed domains and on few sign languages. In open-domain SLT settings, unfortunately, they have shown limited effectiveness [28].

In this paper, we aim to improve open-domain SLT for multiple sign languages by means of large-scale SLT pretraining with more data, larger models and more languages. Inspired by the finding that jointly training SLT models with MT data enables positive knowledge transfer to SLT [57], we explore the following pretraining tasks and data: web-crawled multilingual SLT, multilingual MT, and augmented SLT. Although high-quality SLT training data are scarce, weakly-labeled SLT data covering diverse topics and signers are readily available from platforms like YouTube. Prior studies have demonstrated the feasibility of collecting massive YouTube SLT data and its effectiveness on improving SLT [45, 44, 43], and we follow this effort in a multilingual setup. Different from SLT, text-based MT datasets are massive and resource-rich across hundreds of spoken languages [3, 10]. We explore a subset of MADLAD-400 [23] including up to 41 spoken languages for the pretraining. In addition, we construct synthetic multiway SLT data by translating video captions with an off-the-shelf MT model, which allows us to strengthen direct SLT across more translation directions. We investigate different ways of mixing these data to exploit weakly supervised SLT knowledge as well as cross-task cross-modality transfer at scale.

As in Figure 2, we extend the unified encoder-decoder SLT framework from [45, 42, 44, 43] with extra tasks and modalities similar to [57], across different pretrained model families (T5, mT5 and ByT5) and different model sizes. We distinguish different tasks by carefully designed input prompts that contain task-specific control tokens. This affords us high flexibility in choosing what tasks and languages to incorporate into the pretraining, easing ablations and the scaling. We then finetune the pretrained SLT models on downstream SLT benchmarks to refine the learned SLT knowledge.

We evaluate the effect of scaling on 6 open-domain SLT benchmarks across 5 sign languages. FLEURS-ASL#0 [42], built on FLORES-200 [10], gives us a testbed to analyze multiway American Sign Language (ASL)-to-X SLT (we examine English and 41 other target languages), while the other benchmarks are for a single language pair. While pretraining results show the acquired general SLT capability, we also report finetuning results following [45]. Our main findings are below:

•

Adding more pretraining data, either machine translation or sign language translation data, is a promising way to improve SLT, yielding quality gains of varying degrees.
•

Zero-shot ASL-to-X translation for language pairs not seen during pretraining is achievable by jointly training on ASL-to-En SLT data and En-to-X MT data.
•

Augmenting SLT data by translating target captions to other languages with off-the-shelf MT models substantially improves the translation.
•

Using larger models is not always helpful: ByT5 Base (582M) often outperforms XL (3.74B), but model scaling does benefit SLT when modeling capacity becomes a bottleneck (e.g., when more languages and data are used).
•

Learned metrics (e.g., BLEURT) show higher correlation between pretrained and finetuned SLT scores than classical metrics (e.g., BLEU or ChrF).

Putting everything together, our model achieves new state-of-the-art results across the benchmarks as shown in Figure 1, demonstrating the significance of scaling SLT.

2 Sign Language Translation

2.1 Modeling

We build on a line of work using T5 model families for SLT [45, 42, 44, 43], which build upon earlier SLT work [4, 60, 57] using the encoder-decoder architecture [41, 46]. Figure 2(a) shows the overall structure. The encoder takes as input the concatenation of a prompt instructing the task and a sequence of sign language video frames; the decoder predicts the text output in a target spoken language one token at a time. We adopt the family of pretrained (m/By)T5 models [35, 52] as the backbone and adapt them to SLT via large-scale SLT pretraining followed by downstream finetuning, i.e. (m/By)T5 initialization $\rightarrow$ SLT pretraining $\rightarrow$ SLT finetuning.

We rely on web-crawled YouTube SLT data for SLT pretraining, which provide high coverage on domains and signers albeit at lower quality. Although recent debates value data quality over data quantity in pretraining [21, 24, 29], we argue that they were established on the availability of massive high-quality training data, which doesn’t hold for SLT yet. We expect that the pretraining could capture the (weakly) supervised SLT knowledge from the crawled data as in previous studies [45].

As shown in Figure 2(b), we adopt the clip-level training following [42] that randomly samples a clip of $N$ seconds from the sign video and then predicts various types of in-clip information (such as caption texts and their start and end timestamps) based on the frames of the entire clip. Detailed tasks are listed in Figure 2(a), which are all formulated as sequence-to-sequence tasks. They are distinguished by prompts with different control tokens and are trained with the standard maximum likelihood objective. For the baseline, we consider the following two tasks: SLT and alignment, and train it by mixing these two tasks with a pre-specified mix ratio.

SLT: This is the core task that directly models the translation from clip frames to the clip text in a target language. It is indispensable for the model to acquire the translation capability.
Alignment: It is an auxiliary task for SLT, learning to align the input clip with its captions. We train the model to infer the start and end time stamp for each in-clip caption. Apart from regularization, this task could improve the model’s understanding of sign language [42].

2.2 Scaling Model Size, Number of Languages and Pretraining Data Size

Model Scaling Scaling model size increases modeling capacity, which has been widely proven effective in improving the task performance [30, 18, 19]. We study whether and how increasing model size affects the SLT performance and compare (By/m)T5 models for SLT at different scales.

Language Scaling While most SLT works focus on a few sign and spoken languages, we expand our study to massive languages, covering up to 80 sign/spoken languages during pretraining, and 5 sign language and 42 spoken languages at evaluation. We are interested in whether a single SLT model could support multiple sign/spoken languages with non-trivial performance, and whether knowledge transfer could improve SLT on low-resource languages [56, 44].

Data Scaling Data scarcity is the main bottleneck hindering the development of SLT. To address this issue, we investigate the following three types of data for the pretraining:

SLT: We crawl multilingual YouTube SLT data following the recipe [44] except that we didn’t perform human annotation and filtering. This allows us to significantly scale up the SLT data by 3 $\sim$ 6 times, reaching $\sim$ 6,600 hours in total, albeit at much lower quality.
Machine Translation: Unlike SLT, MT is a text-to-text translation task with rich parallel resources, particularly for high-resource languages [2]. We explore adding multilingual MT data into the pretraining and mark this task with control token “<mt>” [57].
Augmented SLT: SLT data are often one-to-one translation data, where each sign language only has translation in one spoken language. This makes the translation of a sign language to other spoken languages difficult. We thus augment SLT data to one-to-many by translating the target text to other spoken languages via off-the-shelf MT models. As in Figure 2(a), we use “<aug>” to separate genuine SLT data from the augmented one [6].

3 Setup

MT Pretraining Data We use the parallel sentence-level portion of MADLAD-400 [23] as the MT pretraining data. We extract a subset of MADLAD-400 for experiments, including 41 languages (apart from English (En)) covering diverse language families and scripts, and explore the impact of En $\rightarrow$ Xx and Xx $\rightarrow$ En MT data on SLT in experiments. We create two settings for the pretraining:

•

MT-Small: A high/medium-resource subset including 11 languages es, de, fr, it, cs, pl, ru, zh, ar, ja, hi.
•

MT-Large: This set includes all 41 languages. Apart from MT-Small, it has nl, pt, sv, hu, da, fi, el, sk, no, bg, lt, lv, sl, et, ko, hr, is, sr, tr, vi, id, he, th, ms, uk, ca, ta, fil, ne, cy.

Table LABEL:tab:lang_stats shows the statistics for each language. Unless otherwise specified, we balance the MT data distribution over languages during training by temperature sampling with a rate of 5 [2].

SLT Pretraining Data We experiment with noisy captioned sign language videos from YouTube. This is the full set of videos pre-manual filtering in [44]. Estimated statistics for each sign language are summarized in Table LABEL:tab:lang_stats. We also have two settings for this data:

•

YT-ASL: $\sim$ 2,800 hours of noisy captioned ASL videos; a superset of YouTube-ASL [45] (modulo video churn) and the same dataset used by [43].
•

YT-Full: $\sim$ 6,600 hours of noisy captioned multilingual sign language videos; a superset of [44].

During training, we mix the SLT data for all languages in proportion to their duration. We further augment these data with other spoken languages via MADLAD-MT-3B [23]. For ASL SLT data, we translate the English captions to 41 spoken languages listed in MT-Large, which makes YT-ASL 43-way multilingual SLT, namely Aug-YT-ASL; for other SLT data, we translate the target text into English, resulting in 3-way multilingual SLT.²²2Note translations were performed per caption, which may lack coherence when compiled into a document. We refer to the augmented SLT data for all sign languages as Aug-YT-Full. Similar to MT-Small and MT-Large, we reorganize the augmented data to Aug-YT-ASL-Small/Aug-YT-Full-Small and Aug-YT-ASL-Large/Aug-YT-Full-Large.

SLT Pretraining Mixture We ablate across several SLT pretraining mixtures.

•

Baseline: Caption alignment and SLT tasks. We use the task weights from [42], including $4\%$ for alignment.
•

Baseline + MT: We mix MT data into Baseline with a sampling probability of $p_{MT}$ .
•

Baseline + Augmented SLT: We replace the Baseline SLT data with the augmented SLT data and uniformly sample the target language for each example at each step.
•

Baseline + MT + Augmented SLT: Baseline + MT but with augmented target languages, as above.

Task

Sign Lang

Target Lang

#Train

#Dev

#Test

How2Sign

ASL

183,097

10,277

13,890

Elementary23

GSS

35,970

512

WMT23

LIS-CH

1,901

100

250

LSF-CH

5,560

100

250

DSGS

310,840

420

(WMT22)

250/246

(SS/SRF split)

FLEURS-ASL#0

ASL

200 Flores Langs

353

Table 1: Summary of downstream SLT benchmarks. “#Train/#Dev/#Test”: the number of examples in the train, dev and test split. Note the sign language video and the target text in these benchmarks are often pre-segmented and aligned at sentence level. “DGS/ASL/GSS”: German/American/Greek Sign Language; “En/De/Fr/It”: English/German/French/Italian; “LIS-CH”: Italian Sign Language of Switzerland; “LSF-CH”: French Sign Language of Switzerland; “DSGS”: Swiss German Sign Language.

Downstream Benchmarks, Evaluation and Model Setting We thoroughly evaluate the translation performance on a range of open-domain SLT benchmarks, including How2Sign [14], Elementary23 [47]³³3While not as restricted as specific domains like “weather forecasts”, the scope of topics in Elementary23 remains somewhat focused., WMT23 [28] and FLEURS-ASL#0 (signer id #0) [42]. Detailed information for each benchmark is given in Table 1. Overall, the evaluation covers 5 source sign languages and 42 target spoken languages.⁴⁴4We acknowledge that there are other SLT benchmarks available in academia. We didn’t include them in our experiments due to their licensing restrictions and/or domain limitations.

We report translation results for Pretraining and Finetuning. During inference, we use beam search with a beam size of 5. We evaluate translation with detokenized BLEU [31] and ChrF [32], as well as neural metric, BLEURT [34]. We use BLEURT as the main metric [17]. We initialize our SLT model with three T5 model families: T5 [35], mT5 [51] and ByT5 [52], at three different sizes: Base, Large and XL. We optimize models with Adafactor [39], and set the maximum text input, landmark input, and text output length to 512. More setup details are given in Appendix A.1.

4 Experiments

4.1 SLT Pretraining Results

SLT Data	Model	How2Sign			FLEURS-ASL#0 (En)
SLT Data	Model	Base	Large	XL	Base	Large	XL
YT-ASL	T5	29.54	27.95	22.96	32.8	4.18	32.41
	mT5	34.94	8.46	23.7	35.59	43.53	23.09
	ByT5	30.36	23.51	29.2	44.84	28.47	41.65
YT-Full	T5	31.64	25.45	8.57	42.86	37.55	30.02
	mT5	31.46	19.37	24.46	38.03	24.56	33.16
	ByT5	37.13	22.61	29.59	52.48	43.01	52.71

Table 2: Pretraining performance (BLEURT

\uparrow

) for different sized (By/m)T5 models when pretrained on YT-ASL and YT-Full. Results are reported on the test set of How2Sign and FLEURS-ASL#0 (

\rightarrow

En, i.e. English as the target). Best results for each model family are highlighted in bold.

Model scaling doesn’t improve SLT consistently: Base often outperforms Large/XL. Table 2 also shows that scaling up model size rarely results in consistent quality improvements. Different from findings on text-only tasks [35, 52], Base surpasses Large and XL in most cases, where Large often converges the slowest and performs the worst. Model scaling alone doesn’t significantly reduce the video-text modality gap, although better optimization and checkpoint selection could help. XL performs relatively comparable to Base. When MT data is mixed in and modeling capacity becomes the bottleneck, the value of model scaling by XL emerges as shown in Figure 8 and Table 5.

Backbone affects SLT substantially; ByT5 generally performs the best. While several previous studies selected T5 [45, 28] or mT5 [44] as the SLT backbone, we observe in Table 2 that ByT5-based SLT outperforms its T5 counterpart in most settings, confirming the results of [43] at scale. Given that larger models do not consistently perform better, it seems less likely that ByT5’s superiority comes from its encoder-heavy parameter allocation, and more likely that it is due to its spelling capabilities and reduced input length gap between byte text sequences and video frame sequences. Unless otherwise stated, we use ByT5 Base for the following experiments.

Scaling SLT data generally improves quality significantly. Adding more SLT data, i.e. from YT-ASL to YT-Full, largely improves the translation quality in most settings. For ByT5-based SLT particularly, the gain reaches $\sim$ 7 BLEURT on How2Sign and $\sim$ 11 BLEURT on FLEURS-ASL#0 (En) for Base and XL, respectively. We conjecture that adding more (multilingual) SLT data helps reduce the modality gap (especially with skeletons, which lack pretrained representations) and enable cross-lingual knowledge transfer [2, 56, 54, 44].

Mixing MT and SLT data yields positive knowledge transfer to SLT. We next explore whether and how the addition of MT data benefits SLT, starting with YT-ASL and bilingual MT data with $p_{mt}=0.5$ . Figures 3(a) and 5 show that adding bilingual translation data improves SLT performance generally, confirming the findings of SLTUNet [57]—that jointly training with MT enables positive knowledge transfer—at scale. The quality gains vary greatly across languages, which show little correlation with language family or training data scale. For example, adding a large amount of Fr $\rightarrow$ En data ( $\sim$ 243M sentence pairs) helps little (or even hurts) on FLEURS-ASL#0 (En), while adding a small amount of Ja $\rightarrow$ En data ( $\sim$ 5M sentence pairs) gives a gain of at least $3$ BLEURT on How2Sign and FLEURS-ASL#0 (En).

Translation direction of MT data affects transfer to SLT. There are three ways to leverage MT data for SLT: 1) X $\rightarrow$ En, 2) En $\rightarrow$ X, and 3) both. We compare 1) and 3) in Figures 3(a) and 5 for ASL-to-En SLT. The translation direction of MT data influences SLT performance greatly and varies across languages. On average, X $\rightarrow$ En benefits ASL-to-En SLT more than X $\leftrightarrow$ En: +0.08 and +0.9 BLEURT on How2Sign and FLEURS-ASL#0 (EN), respectively. We speculate that including translation into X uses model capacity, which, while enabling zero-shot ASL-to-X SLT as discussed below, results in slightly worse ASL-to-En performance. This suggests that MT data with the same target language as SLT is most effective for transfer. Table 3 shows further support where En $\rightarrow$ X surpasses X $\rightarrow$ En on multilingual SLT.

SLT Data	Dir	BLEURT
SLT Data	Dir	Klein	Large
Baseline + YT-ASL		15.85	17.21
Baseline + YT-Full		24.36	23.16
Baseline + MT-Small
YT-ASL	En $\rightarrow$ X	23.51	21.25
	X $\rightarrow$ En	17.44	19.72
	En $\leftrightarrow$ X	23.84	19.19
YT-Full	En $\rightarrow$ X	27.29	23.15
	X $\rightarrow$ En	22.48	22.47
	En $\leftrightarrow$ X	26.33	21.48
Baseline + MT-Large
YT-ASL	En $\leftrightarrow$ X	24.69	26.60
YT-Full	En $\leftrightarrow$ X	29.52	30.69

Table 3: Pretraining performance for Baseline + MT with

p_{mt}=0.9

when scaling up languages and data. We show averaged BLEURT

\uparrow

results on FLEURS-ASL#0. Results are for ByT5 Base. “Dir”: translation direction of MT data; “Small/Large”: average results over the target languages included in MT-Small/MT-Large on FLEURS-ASL#0.

We can achieve zero-shot bilingual ASL-to-X SLT via ASL-to-En SLT + En $\leftrightarrow$ X MT, albeit at poor quality. If knowledge can be transferred from MT to SLT, one straightforward question is whether we can achieve zero-shot SLT by jointly training with MT. We do so by training on ASL-to-En SLT + En $\leftrightarrow$ X MT data and examining zero-shot ASL-to-X SLT on FLEURS-ASL#0 (X). Figure 3(b) shows that this works effectively. On Pl and It, we observe quality gains over 12 BLEURT; on average, adding MT data improves zero-shot SLT by $\sim$ 6 BLEURT. Nevertheless, the overall zero-shot SLT performance is middling, and the gains are unstable across languages, e.g. performance degrades for ASL-to-Hi SLT with joint MT training. Similar findings were also observed in multilingual MT and speech translation [56, 13].

Using a higher sampling ratio for the MT data, i.e. larger $p_{mt}$ , often improves SLT. We start with $p_{mt}=0.5$ , i.e., sampling equal amount of SLT and MT data, in the above experiments following intuition. However, the proportion of different types of data often has non-negligible influence in multilingual modeling [2, 9]. We next explore its impact on SLT and use MT En-De for illustration. Figure 4 and 6 shows that $p_{mt}=0.5$ is sub-optimal and sampling more MT data improves SLT in most settings, regardless of using ByT5 Base or XL, YT-ASL or YT-Full, MT De $\rightarrow$ En or De $\leftrightarrow$ En, and How2Sign or FLEURS-ASL#0 (En/De). In addition, increasing the proportion of MT data also improves zero-shot ASL-to-De SLT. Note another benefit of using more MT data is to accelerate training, as loading SLT data is much slower than loading text-only MT data. We use $p_{mt}=0.9$ by default in the following experiments.

Multilingual MT improves multilingual (zero-shot) SLT. The above experiments mainly analyze SLT with bilingual MT. We next investigate how multilingual MT affects multilingual (zero-shot) SLT, particularly the use of MT-Small and MT-Large. We report results for ASL-to-Small and ASL-to-Large SLT on FLEURS-ASL#0 where Small and Large denote the target languages covered by MT-Small and MT-Large, respectively. Note all SLT directions are zero-shot except the translation to English.

Table 3 summarizes the average performance. Using multilingual X $\rightarrow$ En MT data results in unstable ASL-to-X SLT performance, which even hurts SLT on YT-Full. In contrast, multilingual En $\rightarrow$ X and En $\leftrightarrow$ X MT data are both very helpful to SLT, where the former often outperforms the latter. By default, we still use En $\leftrightarrow$ X MT data in the following experiments so as to fully leverage the knowledge in MT data during pretraining.

Figures 7(a) and 7(b) further show the language breakdown results. Adding multilingual MT significantly improves ASL-to-En SLT when using YT-ASL alone, while the gain almost disappears when using larger-scale SLT data, YT-Full. Again, we note that the overall zero-shot translation quality is poor – the best average BLEURT on Small and Large is 29.52 and 30.69, respectively. Achieving significant ASL-to-X SLT requires techniques beyond naive SLT and MT data mixing.

Setting	BLEURT
Setting	Klein	Large
Baseline + YT-ASL	15.85	17.21
+ Aug-YT-ASL-Small	31.14	19.74
+ MT-Small	30.51	19.70
+ Aug-YT-ASL&MT-Large	25.83	33.71
+ ByT5 XL	38.53	42.56
+ MT-3B Cascading	34.82	37.82
Baseline + YT-Full	24.36	23.16
+ Aug-YT-Full-Small	38.53	29.49
+ MT-Small	36.84	25.67
+ Aug-YT-Full&MT-Large	36.01	39.85
+ ByT5 XL	45.12	48.05
+ MT-3B Cascading	43.54	46.32
+ ByT5 XL	44.82	47.52

Table 4: Pretraining performance (averaged BLEURT

\uparrow

) for Baseline + Augmented SLT + MT with

p_{mt}=0.9

on FLEURS-ASL#0 test set. MT data are multilingual in both directions. Baseline is for ByT5 Base; “MT-3B”: MADLAD-MT-3B, the model used for SLT augmentation; “Cascading”: translating FLEURS-ASL#0 to English and then performing MT to other target languages.

Data augmentation and large-capacity modeling are promising methods for multilingual SLT. In MT, a common solution to improve zero-shot quality is to construct pseudo translation data for zero-shot directions [2, 15, 55, 16]. We examine this practice for SLT. We adopt publicly pretrained MT models to generate data for more target languages for the YouTube SLT data (i.e., Augmented SLT). Results in Table 4 demonstrate the effectiveness of Augmented SLT, which significantly improves the best performance for ByT5 Base-based SLT to 36.01 and 39.85 average BLEURT on Small and Large with a gain of 6.49 (29.52 $\rightarrow$ 36.01) and 9.16 (30.69 $\rightarrow$ 39.85), respectively. Note there are 42 languages in Large. ByT5 Base may be insufficient in accommodating translation for such amount of languages. Increasing the modeling capacity to XL yields another gain of 9.11 (36.01 $\rightarrow$ 45.12) and 8.2 (39.85 $\rightarrow$ 48.05) average BLEURT on Small and Large, respectively. On YT-ASL, Augmented SLT and ByT5 XL also lead to substantial quality improvements by 13.84 (24.69 $\rightarrow$ 38.53)/15.96 (26.60 $\rightarrow$ 42.56) average BLEURT on Small/Large. The final performance even surpasses the cascading baseline, i.e. ASL-to-En SLT chained with En-to-X MT, under both YT-ASL and YT-Full. Figures 8(a) and 8(b) also show the quality improvements across languages resulted from data augmentation and ByT5 XL.

ID	Model	H2S	E23	WMT23				Avg
ID	Model	H2S	E23	LIS-CH	LSF-CH	SRF	SS	Avg
0	Prevous SOTA	50.80	-	25.20	18.80	24.60	37.70	-
1	ByT5 Base	34.00	22.14	22.77	7.74	15.41	26.88	21.49
2	1 + Baseline + YT-ASL	51.74	37.79	24.24	15.43	21.82	35.59	31.10
3	2 + MT-Small ( $p_{mt}=0.9$ )	52.62	45.98	33.10	24.58	23.33	45.45	37.51
4	3 + Aug-YT-ASL-Small	53.36	49.34	38.61	28.70	25.87	49.61	40.91
5	4 + Aug-YT-ASL&MT-Large + ByT5 XL	54.28	54.16	38.93	27.29	28.42	51.73	42.47
6	2 + YT-Full	53.51	49.48	42.11	31.16	21.15	44.28	40.28
7	6 + Aug-YT-ASL&MT-Small	53.70	53.13	45.09	37.69	30.31	52.45	45.40
8	7 + Aug-YT-ASL&MT-Large + ByT5 XL	55.69	56.94	51.94	41.14	33.94	57.96	49.60
9	8 + Multilingual SLT Tuning	53.47	55.57	54.54	39.26	29.33	58.08	48.38

Table 5: Finetuning performance (BLEURT

\uparrow

) on downstream SLT benchmarks. “H2S/E23”: How2Sign/Elementary23. “SRF/SS”: WMT23 DSGS SRF/SS test split. “Avg”: averaged performance over all benchmarks. MT data are added in both translation directions. Previous SOTA: How2Sign [43], Elementary23 [47] and WMT23 SRF [28], WMT23 LIS-CH, LSF-CH, SS [44]. All models are finetuned on each SLT benchmark separately except (9).

4.2 SLT Finetuning Results

Finetuning on downstream benchmarks substantially improves SLT performance. Table 5 shows that finetuning the pretrained SLT models yields substantial quality gains across benchmarks and settings. This is because the potential of pretrained models is not fully elicited by direct evaluation due to video recording, domain and (clip-based) pretraining vs. (segment-based) inference mismatches, and finetuning largely mitigates these gaps. For example, pretraining with external augmented SLT and MT data results in even worse pretraining performance ((6) $\rightarrow$ (7)) in Table 9. After finetuning, nevertheless, model (7) significantly surpasses model (6) by 5.12 BLEURT on average.

Adding multilingual SLT data (YT-Full) into the pretraining greatly improves the performance from 14.26 (model (5)) to 32.48 BLEURT (model (8)) in Table 9. However, the quality gain after finetuning for YT-ASL based models is often higher than their YT-Full counterparts, where the largest gain reaches $\sim$ 28 BLEURT for model (5). We argue that pretraining on YT-ASL mainly teaches understanding of ASL, so pretrained performance on other sign languages is poor, but finetuning can quickly adapt the learned representations to other sign languages.

Note we also finetuned the vanilla ByT5 model without SLT pretraining for reference, which achieves 7.68 and 3.10 BLEU on Elementary23 and WMT23 DSGS SS, respectively. Despite their inferiority, these results already surpass the previous SOTA, further showing the potential of ByT5.

A model’s pretraining performance may be misleading when estimating its downstream finetuning performance, depending on the evaluation metric. Intuitively, a model with better pretrained results should result in better finetuned results. The Spearman’s correlation results in Table 6 confirm this intuition, where the correlation scores are positive across metrics. However, BLEU and ChrF have a correlation score of 0.347 and 0.186, respectively, which are very moderate. The correlation for ChrF is even not significant, which may be caused by the use of BLEU as the model selection metric. In contrast, the correlation of BLEURT reaches 0.578 and is significant at $p<0.01$ .

	BLEU	ChrF	BLEURT
Spearman’s $\rho$	0.347^†	0.186	0.578^‡

Table 6: Spearman correlation between direct (i.e. pretraining) and finetuning SLT results under different metrics based on Tables 5 and 9.

{}^{\dagger}/^{\ddagger}

: significant at

p<0.05/0.01

Model, data and language scaling together leads to new state-of-the-art results. Diving deeper into Table 5, we see clear improvements brought by scaling model size, data, and/or languages for SLT. Adding YT-ASL SLT data into the pretraining yields $\sim$ 10 average BLEURT improvement ((1) $\rightarrow$ (2)). Jointly training SLT with MT data produces another gain of $\sim$ 6 BLEURT ((2) $\rightarrow$ (3)). Data augmentation adds an improvement of $\sim$ 3 BLEURT ((3) $\rightarrow$ (4)), which matches the quality achieved by adding large amount of extra multilingual SLT data to the baseline, i.e. (4) 40.91 vs. (6) 40.28. By further increasing the amount of MT and augmented SLT data as well as the ByT5 model size, we reach an average BLEU, ChrF and BLEURT of 16.90, 39.49, and 49.60, respectively (model (8)). These results also outperform previous best results, establishing the new SOTA.

Multilingual finetuning improves multilingual SLT with encouraging performance, although it still underperforms bilingual finetuning on average. We next study multilingual finetuning on the direct mix of different SLT benchmarks. Table 5 ((8) $\rightarrow$ (9)) shows that multilingual SLT outperforms previous SOTA on almost all benchmarks, but underperforms its bilingual counterpart by 1.22 BLEURT on average. How to balance modeling capacity among different languages in a joint model and avoid cross-lingual/modality interference is a well known issue in multilingual modeling [2, 56, 48], and multilingual SLT also suffers [54], which we leave to future. Still, multilingual SLT facilitates transfer to LIS-CH, leading to a substantial gain of 2.6 BLEURT ((8) $\rightarrow$ (9)).

5 Related Work

The main bottleneck of SLT is data scarcity. Early studies address this issue by developing more data efficient neural architectures and/or training algorithms. Camgoz et al. [4] pioneered the study with encoder-decoder based recurrent models for SLT, which was quickly replaced by Transformer and multi-task learning with CTC regularization [5]. Zhou et al. [60] developed spatial-temporal architecture to model the collaboration of different visual cues. Another way is to transfer the knowledge from pretrained models, augmentations, and other tasks. Chen et al. [7, 8] proposed to leverage pretrained visual encoders and MT models to improve SLT, while Zhang et al. [57] explored transferring translation knowledge from MT data directly. Zhou et al. [59] employed back-translation to generate pseudo SLT training data. Ye et al. [53] augmented the training data by the mix-up algorithm. Yet another way to address data scarcity is to make data less scarce. Shi et al. [40], Uthus et al. [45], and Tanzer and Zhang [44] collected large-scale SLT data from YouTube and improved data quality via manual filtering; Albanie et al. [1] developed a British Sign Language translation corpus based on BBC broadcasts instead. Tanzer [43] scaled up ASL data by eschewing manual filtering and tolerating misaligned or irrelvant data. We follow and scale to noisy multilingual sign language data, MT data, and augmented paralel data.

Despite the aforementioned advancements, many studies still heavily depend on sign glosses. As a bridge between sign video and target text, sign glosses ease learning, but are expensive to annotate, not always available, nonstandardized, and cannot cope with sign language grammar in generality [12]. Recent research therefore turns to gloss-free SLT, which often underperforms gloss-based counterparts [25, 58, 49] and performs poorly in open-domain settings [38, 50, 28]. We substantially improve gloss-free SLT performance across benchmarks through scaling. In this regard, our work is closely related to SSVP-SLT [37] but with different focuses. SSVP-SLT improves SLT by pretraining a neural sign encoder through large-scale self-supervised learning. By contrast, we adopt static landmarks to represent sign frames and improve the translation by transferring knowledge from other languages and tasks. The methods used in our study are orthogonal to SSVP-SLT. In addition, our work also falls into the category of improving multilingual SLT [54, 20]. We didn’t evaluate our models on these multilingual benchmarks though as they are either unavailable at the time of paper writing or unusable due to licensing issues.

6 Conclusion, Limitations, and Future Work

We presented a systematic study of data, model and language scaling for SLT via large-scale SLT pretraining. In general, scaling substantially improves SLT. We observe positive knowledge transfer from other sign language data and from machine translation data. By joint SLT and MT training, we show the feasibility of achieving zero-shot SLT. Data augmentation expanding SLT data to more spoken languages via off-the-shelf MT models significantly improves multilingual SLT. Putting everything together, finetuning our pretrained SLT models leads to new state-of-the-art SLT results across 5 benchmarks covering 5 sign languages (but still far from usable quality).

Although our models have nominally been pretrained on a massive number of sign languages (up to 80), we lack comprehensive and reliable multilingual benchmarks to fully understand their abilities and limitations. In addition, our models are limited to encoder-decoder based (m/By)T5 models, and SLT pretraining requires many computational resources, increasing the difficulty of reproduction.

In the future, we expect that continuing to scale sign language data, number of sign languages, vision pretraining/multimodality, etc. will reap further gains. As suggested by [42], it will be important to evaluate these growing capabilities on multilingual, standardized, open-domain SLT benchmarks.

Ethics Statement

We preprocess all sign videos with simplified landmarks as a form of anonymization and privacy protection. While the pretraining SLT data is larger scale than prior work, it may still suffer from demographic biases. Even if demographics were represented in proportion to the real world, and even with simplified landmarks, the resulting SLT models may not perform equally across groups and should be evaluated for fairness before real-world deployment. Our study mainly aims to understand the impact of scaling on SLT, and while we significantly improve translation quality, it is still far from usable for real-world applications. For many such applications, the other half of sign language translation—sign language generation—is also essential, whereas we focus only on sign language understanding in this work. Advancing both of these is critical to ensure that Deaf/Hard of Hearing signers get equal access to technology and the information that comes through it.

Acknowledgements

We thank Ankush Garg for valuable feedback on this work, Chris Dyer for constructive comments that greatly improve the quality of this paper, Sam Sepah and Google Translate team for supporting this research. We also thank the T5X team [36] for infrastructure support.

References

Albanie et al. [2021] Samuel Albanie, Gül Varol, Liliane Momeni, Hannah Bull, Triantafyllos Afouras, Himel Chowdhury, Neil Fox, Bencie Woll, Rob Cooper, Andrew McParland, et al. Bbc-oxford british sign language dataset. arXiv preprint arXiv:2111.03635, 2021.
Arivazhagan et al. [2019] Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. Massively multilingual neural machine translation in the wild: Findings and challenges, 2019.
Bapna et al. [2022] Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, et al. Building machine translation systems for the next thousand languages. arXiv preprint arXiv:2205.03983, 2022.
Camgoz et al. [2018] Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. Neural sign language translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7784–7793, 2018.
Camgoz et al. [2020] Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10023–10033, 2020.
Caswell et al. [2019] Isaac Caswell, Ciprian Chelba, and David Grangier. Tagged back-translation. In Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Marco Turchi, and Karin Verspoor, editors, Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 53–63, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5206. URL https://aclanthology.org/W19-5206.
Chen et al. [2022a] Yutong Chen, Fangyun Wei, Xiao Sun, Zhirong Wu, and Stephen Lin. A simple multi-modality transfer learning baseline for sign language translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5120–5130, 2022a.
Chen et al. [2022b] Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, and Brian Mak. Two-stream network for sign language recognition and translation. Advances in Neural Information Processing Systems, 35:17043–17056, 2022b.
Conneau et al. [2019] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.
Costa-jussà et al. [2022] Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022.
Dehghani et al. [2023] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
Desai et al. [2024] Aashaka Desai, Maartje De Meulder, Julie A. Hochgesang, Annemarie Kocab, and Alex X. Lu. Systemic biases in sign language ai research: A deaf-led call to reevaluate research agendas, 2024.
Dinh [2021] Tu Anh Dinh. Zero-shot speech translation. arXiv preprint arXiv:2107.06010, 2021.
Duarte et al. [2021] Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, and Xavier Giro-i Nieto. How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Fan et al. [2021] Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, et al. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48, 2021.
Freitag and Firat [2020] Markus Freitag and Orhan Firat. Complete multilingual neural machine translation. arXiv preprint arXiv:2010.10239, 2020.
Freitag et al. [2022] Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, Eleftherios Avramidis, Tom Kocmi, George Foster, Alon Lavie, and André F. T. Martins. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Philipp Koehn, Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Marco Turchi, and Marcos Zampieri, editors, Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.wmt-1.2.
Gemini et al. [2023] Team Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Gemini et al. [2024] Team Gemini, Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
Gueuwou et al. [2023] Shester Gueuwou, Sophie Siake, Colin Leong, and Mathias Müller. JWSign: A highly multilingual corpus of Bible translations for more diversity in sign language processing. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9907–9927, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.664. URL https://aclanthology.org/2023.findings-emnlp.664.
Gunasekar et al. [2023] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
Kocmi et al. [2021] Tom Kocmi, Christian Federmann, Roman Grundkiewicz, Marcin Junczys-Dowmunt, Hitokazu Matsushita, and Arul Menezes. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. In Loic Barrault, Ondrej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Tom Kocmi, Andre Martins, Makoto Morishita, and Christof Monz, editors, Proceedings of the Sixth Conference on Machine Translation, pages 478–494, Online, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.wmt-1.57.
Kudugunta et al. [2023] Sneha Kudugunta, Isaac Rayburn Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. MADLAD-400: A multilingual and document-level large audited dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=Y45ZCxslFx.
Lee et al. [2022] Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. URL https://aclanthology.org/2022.acl-long.577.
Lin et al. [2023] Kezhou Lin, Xiaohan Wang, Linchao Zhu, Ke Sun, Bang Zhang, and Yi Yang. Gloss-free end-to-end sign language translation. arXiv preprint arXiv:2305.12876, 2023.
Lugaresi et al. [2019] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
Ma et al. [2019] Qingsong Ma, Johnny Wei, OndÅ™ej Bojar, and Yvette Graham. Results of the wmt19 metrics shared task: Segment-level and strong mt systems pose big challenges. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 62–90, Florence, Italy, August 2019. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/W19-5302.
Müller et al. [2023] Mathias Müller, Malihe Alikhani, Eleftherios Avramidis, Richard Bowden, Annelies Braffort, Necati Cihan Camgöz, Sarah Ebling, Cristina España-Bonet, Anne Göhring, Roman Grundkiewicz, Mert Inan, Zifan Jiang, Oscar Koller, Amit Moryossef, Annette Rios, Dimitar Shterionov, Sandra Sidler-Miserez, Katja Tissi, and Davy Van Landuyt. Findings of the second WMT shared task on sign language translation (WMT-SLT23). In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the Eighth Conference on Machine Translation, pages 68–94, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.4. URL https://aclanthology.org/2023.wmt-1.4.
Nguyen et al. [2022] Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, and Ludwig Schmidt. Quality not quantity: On the interaction between dataset design and robustness of clip. Advances in Neural Information Processing Systems, 35:21455–21469, 2022.
OpenAI et al. [2023] Team OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Papineni et al. [2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://doi.org/10.3115/1073083.1073135.
Popović [2015] Maja Popović. chrF: character n-gram F-score for automatic MT evaluation. In Ondřej Bojar, Rajan Chatterjee, Christian Federmann, Barry Haddow, Chris Hokamp, Matthias Huck, Varvara Logacheva, and Pavel Pecina, editors, Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/W15-3049. URL https://aclanthology.org/W15-3049.
Pratap et al. [2023] Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages. arXiv preprint arXiv:2305.13516, 2023.
Pu et al. [2021] Amy Pu, Hyung Won Chung, Ankur P Parikh, Sebastian Gehrmann, and Thibault Sellam. Learning compact metrics for mt. In Proceedings of EMNLP, 2021.
Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Roberts et al. [2023] Adam Roberts, Hyung Won Chung, Gaurav Mishra, Anselm Levskaya, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, et al. Scaling up models and data with t5x and seqio. Journal of Machine Learning Research, 24(377):1–8, 2023.
Rust et al. [2024] Phillip Rust, Bowen Shi, Skyler Wang, Necati Cihan Camgöz, and Jean Maillard. Towards privacy-aware sign language translation at scale. arXiv preprint arXiv:2402.09611, 2024.
Sandoval-Castaneda et al. [2023] Marcelo Sandoval-Castaneda, Yanhong Li, Bowen Shi, Diane Brentari, Karen Livescu, and Gregory Shakhnarovich. TTIC’s submission to WMT-SLT 23. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the Eighth Conference on Machine Translation, pages 344–350, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.35. URL https://aclanthology.org/2023.wmt-1.35.
Shazeer and Stern [2018] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
Shi et al. [2022] Bowen Shi, Diane Brentari, Gregory Shakhnarovich, and Karen Livescu. Open-domain sign language translation learned from online video. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6365–6379, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.427. URL https://aclanthology.org/2022.emnlp-main.427.
Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf.
Tanzer [2024a] Garrett Tanzer. Fleurs-asl: Including american sign language in massively multilingual multitask evaluation (coming soon). arXiv, 2024a.
Tanzer [2024b] Garrett Tanzer. Fingerspelling within sign language translation (coming soon). arXiv, 2024b.
Tanzer and Zhang [2024] Garrett Tanzer and Biao Zhang. Youtube-sl-25: A large-scale, open-domain multilingual sign language parallel corpus. arXiv, 2024.
Uthus et al. [2024] Dave Uthus, Garrett Tanzer, and Manfred Georg. Youtube-asl: A large-scale, open-domain american sign language-english parallel corpus. Advances in Neural Information Processing Systems, 36, 2024.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Voskou et al. [2023] Andreas Voskou, Konstantinos P Panousis, Harris Partaourides, Kyriakos Tolias, and Sotirios Chatzis. A new dataset for end-to-end sign language translation: The greek elementary school dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1966–1975, 2023.
Wang et al. [2019] Zirui Wang, Zihang Dai, Barnabas Poczos, and Jaime Carbonell. Characterizing and avoiding negative transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
Wong et al. [2024] Ryan Wong, Necati Cihan Camgoz, and Richard Bowden. Sign2GPT: Leveraging large language models for gloss-free sign language translation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=LqaEEs3UxU.
Xu et al. [2023] Baixuan Xu, Haochen Shi, Tianshi Zheng, Qing Zong, Weiqi Wang, Zhaowei Wang, and Yangqiu Song. KnowComp submission for WMT23 sign language translation task. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the Eighth Conference on Machine Translation, pages 351–358, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.36. URL https://aclanthology.org/2023.wmt-1.36.
Xue et al. [2020] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
Xue et al. [2022] Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022. doi: 10.1162/tacl_a_00461. URL https://aclanthology.org/2022.tacl-1.17.
Ye et al. [2023] Jinhui Ye, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, and Hui Xiong. Cross-modality data augmentation for end-to-end sign language translation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13558–13571, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.904. URL https://aclanthology.org/2023.findings-emnlp.904.
Yin et al. [2022] Aoxiong Yin, Zhou Zhao, Weike Jin, Meng Zhang, Xingshan Zeng, and Xiaofei He. Mlslt: Towards multilingual sign language translation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5099–5109, 2022. doi: 10.1109/CVPR52688.2022.00505.
Zhang and Sennrich [2021] Biao Zhang and Rico Sennrich. Edinburgh’s end-to-end multilingual speech translation system for IWSLT 2021. In Marcello Federico, Alex Waibel, Marta R. Costa-jussà, Jan Niehues, Sebastian Stuker, and Elizabeth Salesky, editors, Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages 160–168, Bangkok, Thailand (online), August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.iwslt-1.19. URL https://aclanthology.org/2021.iwslt-1.19.
Zhang et al. [2020] Biao Zhang, Philip Williams, Ivan Titov, and Rico Sennrich. Improving massively multilingual neural machine translation and zero-shot translation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1628–1639, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.148. URL https://aclanthology.org/2020.acl-main.148.
Zhang et al. [2023] Biao Zhang, Mathias Müller, and Rico Sennrich. SLTUNET: A simple unified model for sign language translation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=EBS4C77p_5S.
Zhou et al. [2023] Benjia Zhou, Zhigang Chen, Albert Clapés, Jun Wan, Yanyan Liang, Sergio Escalera, Zhen Lei, and Du Zhang. Gloss-free sign language translation: Improving from visual-language pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20871–20881, 2023.
Zhou et al. [2021a] Hao Zhou, Wengang Zhou, Weizhen Qi, Junfu Pu, and Houqiang Li. Improving sign language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1316–1325, 2021a.
Zhou et al. [2021b] Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. Spatial-temporal multi-cue network for sign language recognition and translation. IEEE Transactions on Multimedia, 24:768–779, 2021b.

Appendix A Appendix

Table 7: Data statistics for YouTube sign language and MADLAD spoken language. We list ISO 639-3 code, language name, and the number of hours/clips/videos for sign language; for spoken language, we list BCP-47 code, language name and the number of parallel examples. “K”: thousand, “M”: million. Note that like Tanzer and Zhang [44] pre-filtering and Tanzer [43], these language labels are heuristically estimated based on public video metadata, such as caption language and text in the video title, description, etc.

Sign Language (SL)					Spoken Language
ISO 639-3	Name	# Hours	# Clips	# Videos	BCP-47	Name	# Examples
ase	American SL	2.8K	285.2K	25.9K	es	Spanish	292.8M
bzs	Brazilian SL	590.4	60.2K	5.9K	de	German	283.3M
pso	Polish SL	421.8	41.8K	3.2K	fr	French	243.6M
ins	Indian SL	375.3	39.3K	5.7K	it	Italian	100.1M
bfi	British SL	267.3	27.4K	2.7K	cs	Czech	53.1M
gsg	German SL	235.2	24.2K	2.1K	pl	Polish	42.9M
fsl	French SL	193.7	19.9K	2.9K	ru	Russian	29.0M
jsl	Japanese SL	176.5	17.9K	3.0K	zh	Simplified Chinese	25.9M
ise	Italian SL	161.0	16.8K	2.4K	ar	Arabic	18.2M
asf	Australian SL	123.1	12.5K	1.5K	ja	Japanese	5.3M
rsl	Russian SL	119.4	12.1K	1.2K	hi	Hindi	1.2M
csc	Catalan SL	114.8	11.8K	1.7K	nl	Dutch	93.1M
csn	Colombian SL	107.8	11.1K	478.0	pt	Portuguese	83.7M
aed	Argentine SL	86.1	8.8K	522.0	sv	Swedish	51.9M
mfs	Mexican SL	76.9	7.6K	434.0	hu	Hungarian	40.0M
kvk	Korean SL	69.3	7.1K	612.0	da	Dänisch	38.2M
hsh	Hungarian SL	55.9	5.8K	1.3K	fi	Finnish	34.1M
sgg	Swiss German SL	48.7	5.1K	634.0	el	Greek	25.2M
prl	Peruvian SL	42.3	4.2K	147.0	sk	Slovak	25.0M
fse	Finnish SL	41.2	4.3K	593.0	no	Norwegian	19.4M
swl	Swedish SL	38.7	4.0K	563.0	bg	Bulgarian	15.5M
asq	Austrian SL	31.8	3.2K	592.0	lt	Lithuanian	15.3M
tsm	Turkish SL	31.8	3.3K	446.0	lv	Latvian	14.3M
dse	Dutch SL	31.4	3.2K	352.0	sl	Slovenian	11.8M
cse	Czech SL	29.2	3.0K	336.0	et	Estonian	11.0M
inl	Indonesian SL	27.5	2.8K	236.0	ko	Korean	5.8M
nsl	Norwegian SL	22.0	2.2K	215.0	hr	Croatian	5.3M
hks	Hong Kong SL	21.3	2.2K	248.0	is	Icelandic	4.1M
tss	Taiwan SL	19.6	2.0K	258.0	sr	Serbian	2.5M
gss	Greek SL	18.2	1.9K	169.0	tr	Turkish	2.5M
dsl	Danish SL	16.0	1.6K	188.0	vi	Vietnamese	1.5M
csg	Chilean SL	15.2	1.5K	184.0	id	Indonesian	1.4M
sfb	French Belgian SL	14.4	1.5K	358.0	he	Hebrew	1.1M
isr	Israeli SL	14.3	1.4K	289.0	th	Thai	1.1M
vietnam	Vietnamese SL	14.2	1.1K	97.0	ms	Malay	907.5K
isg	Irish SL	13.2	1.4K	101.0	uk	Ukrainian	881.0K
slovenia	Slovenian SL	12.7	1.3K	177.0	ca	Catalan	686.2K
nzs	New Zealand SL	12.3	1.3K	224.0	ta	Tamil	396.9K
icl	Icelandic SL	11.1	1.2K	213.0	fil	Filipino	369.8K
sls	Singapore SL	9.9	1.0K	148.0	ne	Nepali	277.9K
tsq	Thai SL	9.0	923.0	150.0	cy	Welsh	93.3K
pks	Pakistani SL	8.6	902.0	145.0
svk	Slovak SL	8.5	886.0	114.0
jos	Jordanian SL	8.3	880.0	124.0
lls	Lithuanian SL	7.6	792.0	166.0
csr	Costa Rican SL	7.6	789.0	45.0
psr	Portuguese SL	7.4	764.0	127.0
rms	Romanian SL	7.3	747.0	60.0
xml	Malaysian SL	7.2	668.0	85.0
ecs	Ecuadorian SL	7.2	727.0	28.0
psp	Filipino SL	6.8	715.0	85.0
sfs	South African SL	5.2	543.0	41.0
ugy	Uruguay SL	4.7	483.0	113.0
esn	Salvadoran SL	3.9	403.0	15.0
xki	Kenyan SL	3.8	384.0	23.0
serbia	Serbian SL	3.3	351.0	36.0
csq	Croatian SL	3.1	326.0	31.0
esl	Egyptian SL	3.0	262.0	32.0
psl	Puerto Rican SL	1.8	181.0	17.0
bengladesh	Bengali SL	1.6	165.0	15.0
gsm	Guatemalan SL	1.4	144.0	26.0
xms	Moroccan SL	1.2	125.0	4.0
lsp	Panamanian SL	1.2	123.0	8.0
fcs	Quebec SL	0.9	91.0	27.0
eso	Estonian SL	0.9	90.0	13.0
emirati	UAE SL	0.8	81.0	14.0
vsl	Venezuelan SL	0.7	69.0	15.0
pys	Paraguayan SL	0.7	69.0	5.0
kazakh	Kazakh SL	0.6	69.0	6.0
hds	Honduran SL	0.6	67.0	8.0
macau	Macau SL	0.6	109.0	109.0
sdl	Saudi Arabian SL	0.5	49.0	7.0
doq	Dominican SL	0.5	45.0	10.0
belarus	Belarusian SL	0.4	29.0	5.0
bqn	Bulgarian SL	0.3	31.0	9.0
sqs	Sri Lankan SL	0.3	29.0	8.0
lsl	Latvian SL	0.2	27.0	3.0
bvl	Bolivian SL	0.2	26.0	5.0
nsi	Nigerian SL	0.1	12.0	2.0
nsp	Nepali SL	0.1	6.0	1.0

A.1 Setup

Sign Video Preprocessing

Our landmark preprocessing is identical to [45], and we use the same random 34-second video clipping as [42]. We preprocess sign language video with its default frame rate but discard every second frame for computational efficiency. We convert each video frame to a 255-dimensional normalized vector using MediaPipe Holistic landmarks [26], which also facilitates video anonymization. The input video is eventually transformed into a vector sequence and then mapped to the encoder via a linear projection layer.

Downstream Benchmarks

Note the official SLT track in WMT23 for LIS-CH and LSF-CH is for sign language generation rather than translation. We reversed it as a SLT dataset. FLEURS-ASL#0 is the subset of FLEURS-ASL [42] recorded by signer #0, i.e. 353 sentences from FLORES [10] translated into ASL by a Certified Deaf Interpreter. We report only signer #0 because the rest of the benchmark was not complete when these experiments were run.

For these benchmarks, we only use the sign language video and target text without glosses. All SLT models in this study are gloss-free.

Model Setting

For Pretraining, we use a batch size of 256 and a constant learning rate of 0.001. We pretrain models up to 1M steps using 64/64/128 TPU-v3 chips for Base/Large/XL, taking 7-20 days. We select the best checkpoint for downstream application based on the How2Sign dev performance measured by BLEU⁵⁵5We didn’t adopt BLEURT for model selection because it’s significantly more expensive and time-consuming than BLEU..

For Finetuning, we use a batch size of 32 and a constant learning rate of 0.0005. By default, we perform finetuning on each downstream benchmark separately. We only consider the SLT task at finetuning, and directly finetune the model on well aligned (sign video segment, target translation) pairs, which is provided in all downstream benchmarks. We tune models up to 50K steps using 16/32 TPU-v3 chips for Base/XL, taking 2 $\sim$ 5 days. We select the best checkpoint for final evaluation based on the dev-set BLEU.

A.2 More Results and Analysis

ID	Model	H2S	E23	WMT23				Avg
ID	Model	H2S	E23	LIS-CH	LSF-CH	SRF	SS	Avg
0	Prevous SOTA	18.10	5.69	5.20	7.00	0.30	7.50	7.30
1	ByT5 Base	3.71	7.68	0.40	0.79	1.01	3.10	2.78
2	1 + Baseline + YT-ASL	17.94	16.58	6.59	8.65	2.14	8.69	10.10
3	2 + MT-Small ( $p_{mt}=0.9$ )	17.30	17.90	8.84	11.86	2.04	11.01	11.49
4	3 + Aug-YT-ASL-Small	18.55	22.09	9.81	12.91	2.90	14.34	13.43
5	4 + Aug-YT-ASL&MT-Large + ByT5 XL	19.31	24.08	9.70	11.71	2.76	13.26	13.47
6	2 + YT-Full	19.79	21.81	12.99	15.32	2.07	13.71	14.28
7	6 + Aug-YT-ASL&MT-Small	18.98	23.60	13.63	15.75	2.85	14.44	14.88
8	7 + Aug-YT-ASL&MT-Large + ByT5 XL	21.06	25.65	14.93	18.77	2.80	18.17	16.90
9	8 + Multilingual SLT Tuning	19.25	23.05	16.79	17.22	2.91	15.92	15.86

(a) BLEU

\uparrow

scores.

ID	Model	H2S	E23	WMT23				Avg
ID	Model	H2S	E23	LIS-CH	LSF-CH	SRF	SS	Avg
0	Prevous SOTA	-	-	-	-	17.50	-	-
1	ByT5 Base	19.55	25.19	14.81	15.56	11.08	21.75	17.99
2	1 + Baseline + YT-ASL	38.78	41.34	27.57	29.99	17.06	38.31	31.18
3	2 + MT-Small ( $p_{mt}=0.9$ )	39.01	45.40	30.56	33.67	16.13	40.38	34.19
4	3 + Aug-YT-ASL-Small	39.64	47.92	31.40	34.95	17.49	43.56	35.83
5	4 + Aug-YT-ASL&MT-Large + ByT5 XL	40.15	48.70	29.44	33.20	17.96	40.21	34.94
6	2 + YT-Full	41.13	47.71	38.23	38.13	16.78	42.33	37.38
7	6 + Aug-YT-ASL&MT-Small	40.35	49.57	38.07	39.88	19.34	42.83	38.34
8	7 + Aug-YT-ASL&MT-Large + ByT5 XL	41.97	50.09	38.90	40.43	19.58	45.94	39.49
9	8 + Multilingual SLT Tuning	40.39	48.49	40.28	38.63	18.72	45.22	38.62

(b) ChrF

\uparrow

scores.

ID	Model	H2S	E23	WMT23				Avg
ID	Model	H2S	E23	LIS-CH	LSF-CH	SRF	SS	Avg
0	Prevous SOTA	50.80	-	25.20	18.80	24.60	37.70	-
1	ByT5 Base	34.00	22.14	22.77	7.74	15.41	26.88	21.49
2	1 + Baseline + YT-ASL	51.74	37.79	24.24	15.43	21.82	35.59	31.10
3	2 + MT-Small ( $p_{mt}=0.9$ )	52.62	45.98	33.10	24.58	23.33	45.45	37.51
4	3 + Aug-YT-ASL-Small	53.36	49.34	38.61	28.70	25.87	49.61	40.91
5	4 + Aug-YT-ASL&MT-Large + ByT5 XL	54.28	54.16	38.93	27.29	28.42	51.73	42.47
6	2 + YT-Full	53.51	49.48	42.11	31.16	21.15	44.28	40.28
7	6 + Aug-YT-ASL&MT-Small	53.70	53.13	45.09	37.69	30.31	52.45	45.40
8	7 + Aug-YT-ASL&MT-Large + ByT5 XL	55.69	56.94	51.94	41.14	33.94	57.96	49.60
9	8 + Multilingual SLT Tuning	53.47	55.57	54.54	39.26	29.33	58.08	48.38

\uparrow

scores.

Table 8: Finetuning performance on downstream SLT benchmarks. “H2S/E23”: How2Sign/Elementary23. “SRF/SS”: WMT23 DSGS SRF/SS test split. “Avg”: averaged performance over all benchmarks. MT data are added in both translation directions. Previous SOTA: How2Sign [43], Elementary23 [47] and WMT23 SRF [28], WMT23 LIS-CH, LSF-CH, SS [44]. Scaling SLT reaches new SOTA across benchmarks. All models are finetuned on each SLT benchmark separately except (9).

ID	Model	H2S	E23	WMT23				Avg
ID	Model	H2S	E23	LIS-CH	LSF-CH	SRF	SS	Avg
1	ByT5 Base
2	1 + Baseline + YT-ASL	3.77	0.06	0.15	0.35	0.15	0.15	0.77
3	2 + MT-Small ( $p_{mt}=0.9$ )	4.75	0.02	0.07	0.25	0.06	0.12	0.88
4	3 + Aug-YT-ASL-Small	3.31	0.01	0.12	0.43	0.12	0.31	0.72
5	4 + Aug-YT-ASL&MT-Large + ByT5 XL	2.81	0.21	0.22	0.31	0.06	0.24	0.64
6	2 + YT-Full	5.78	0.33	3.43	5.69	1.08	3.88	3.37
7	6 + Aug-YT-ASL&MT-Small	4.10	0.05	1.67	2.65	0.50	2.21	1.86
8	7 + Aug-YT-ASL&MT-Large + ByT5 XL	4.05	2.45	4.50	3.73	0.64	3.45	3.14

(a) BLEU

\uparrow

scores.

ID	Model	H2S	E23	WMT23				Avg
ID	Model	H2S	E23	LIS-CH	LSF-CH	SRF	SS	Avg
1	ByT5 Base
2	1 + Baseline + YT-ASL	20.65	6.62	12.67	15.1	13.89	13.95	13.81
3	2 + MT-Small ( $p_{mt}=0.9$ )	19.55	0.05	10.9	11.33	10.75	12.89	10.91
4	3 + Aug-YT-ASL-Small	15.21	0.05	9.67	9.87	5.89	14.28	9.16
5	4 + Aug-YT-ASL&MT-Large + ByT5 XL	11.40	9.25	8.75	6.88	3.35	7.55	7.86
6	2 + YT-Full	23.44	14.81	25.34	26.78	16.42	28.36	22.53
7	6 + Aug-YT-ASL&MT-Small	18.18	10.22	19.81	22.64	10.25	25.95	17.84
8	7 + Aug-YT-ASL&MT-Large + ByT5 XL	13.47	22.34	25.53	26.16	14.82	26.96	21.55

(b) ChrF

\uparrow

scores.

ID	Model	H2S	E23	WMT23				Avg
ID	Model	H2S	E23	LIS-CH	LSF-CH	SRF	SS	Avg
1	ByT5 Base
2	1 + Baseline + YT-ASL	30.36	9.13	9.32	6.33	9.69	10.45	12.55
3	2 + MT-Small ( $p_{mt}=0.9$ )	34.24	1.07	16.38	10.44	13.14	14.81	15.01
4	3 + Aug-YT-ASL-Small	25.2	1.63	23.41	10.82	10.65	22.06	15.61
5	4 + Aug-YT-ASL&MT-Large + ByT5 XL	23.87	10.35	21.58	6.75	7.04	15.96	14.26
6	2 + YT-Full	37.13	15.08	28.92	18.33	17.87	29.66	24.50
7	6 + Aug-YT-ASL&MT-Small	25.25	12.68	33.54	24.48	10.66	33.91	23.42
8	7 + Aug-YT-ASL&MT-Large + ByT5 XL	22.41	34.14	43.07	34.26	21.52	39.47	32.48

\uparrow

scores.

Table 9: Pretraining performance on downstream SLT benchmarks.

Different evaluation metrics may disagree.

There is a hot debate in MT community regarding which metric we should use for translation evaluation [22]. While BLEU has been widely adopted, it often shows poor correlation with human evaluation, particularly when the translation models are strong [27]. Instead, neural metrics are recommended [17]. We follow this trend and adopt BLEURT as the main metric. To be compatible with past studies and also ease future comparison, we also add BLEU and ChrF. Table 8 shows some disagreements between BLEURT and BLEU/ChrF. For example, model (4) performs better than (comparable to) model (5) on average based on ChrF (BLEU), while BLEURT scores show a clear superiority of model (5) over (4). Evaluation metric selection should be more careful due to these disagreements. In this study, we rely more on BLEURT for the analysis as it correlates better with human evaluation [17].