Towards Zero-Shot Multimodal Machine Translation

Matthieu Futeral^1,2 Cordelia Schmid^1,2 Benoît Sagot¹ Rachel Bawden¹
¹Inria Paris
²Département d’informatique de l’ENS, CNRS, PSL Research University
[email protected]

Abstract

Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work, we propose a method to bypass the need for fully supervised data to train MMT systems, using multimodal English data only. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives: visually conditioned masked language modelling and the Kullback-Leibler divergence between the original and new MMT outputs. We evaluate on standard MMT benchmarks and the recently released CoMMuTE, a contrastive benchmark aiming to evaluate how well models use images to disambiguate English sentences. We obtain disambiguation performance close to state-of-the-art MMT models trained additionally on fully supervised examples. To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese. We further show that we can control the trade-off between disambiguation capabilities and translation fidelity at inference time using classifier-free guidance and without any additional data. Our code, data and trained models are publicly accessible.¹¹1https://github.com/MatthieuFP/CoMMuTE²²2https://github.com/MatthieuFP/zerommt

1 Introduction

Multimodal machine translation (MMT) refers to the use of additional modalities, such as images or videos, in machine translation (MT) systems. The main purpose is to provide an additional signal in the case of ambiguity in the text to be translated. Most current MMT models are trained solely on the Multi30K (M30K) dataset (Elliott et al., 2016, 2017; Barrault et al., 2018), a multilingual and multimodal corpus composed of 30,000 images, their captions in English and translations in French, German and Czech. There have been recent breakthroughs in MMT thanks to the use of pretrained text-only MT systems and monolingual captioning data in order to adapt MT systems to MMT (Futeral et al., 2023; Gupta et al., 2023; Vijayan et al., 2024). Good results have been shown using this strategy on CoMMuTE (Futeral et al., 2023), a contrastive benchmark designed to evaluate MMT models on their use of images to disambiguate between contrastive translations, and these results were significantly better than MMT systems trained on M30K only (Yin et al., 2020; Yao and Wan, 2020; Liu et al., 2021; Wu et al., 2021; Li et al., 2022b). However, these models still rely on the multilingual and multimodal M30K corpus during training to ensure good translation performance. This presents a fundamental limitation: collecting translations for captioning data is expensive³³3Authors of M30K stated they spent €23,000 on the translation of the 30,000 English captions into German. and this restricts the extension of MMT to new languages. Attempts to bypass the problem have been made by carrying out zero-shot transfer between languages (Hirasawa et al., 2023), but this results in the poor use of images to disambiguate ambiguous texts (i.e. to correctly exploit the visual modality).

In this work, we address this limitation by proposing a method requiring only monolingual multimodal text data, removing the need for fully supervised data, i.e. data such as M30K that is both parallel and multimodal. To do so, we start from a strong pretrained MT system and use it to translate multimodal English data into the target languages of interest. We then adapt the pretrained MT system to images using two objectives: (1) visually conditioned masked language modelling (VMLM) (Li et al., 2019; Lu et al., 2019) on multimodal English data to force the model to use image information and (2) a KL penalty on the translated multimodal data to maintain translation capabilities. We test our method on six languages directions: English to French, Czech, German, Arabic, Russian and Chinese, extending the CoMMuTE dataset to cover the three additional languages⁴⁴4https://github.com/MatthieuFP/CoMMuTE. Our method, called ZeroMMT, obtains CoMMuTE scores close to the supervised state of the art, while there is only a small drop in BLEU and COMET scores compared to the underlying text-only MT system on standard MMT benchmarks composed mainly of unambiguous examples. We further show that we can control the trade-off between disambiguation and global translation performance at inference time with classifier-free guidance.

2 Related Work

Training MMT systems

Research in MMT originally focused on which visual features to use (Li et al., 2022a) and how to integrate them into sequence-to-sequence models (Sutskever et al., 2014) while training them from scratch on the widely used M30K benchmark (Libovický et al., 2016; Calixto et al., 2016; Elliott and Kádár, 2017; Calixto and Liu, 2017; Yao and Wan, 2020; Yin et al., 2020; Lin et al., 2020; Liu et al., 2021; Li et al., 2022b). These MMT systems typically show improvements of around 1-2 BLEU points on standard MMT benchmarks in comparison to text-only baselines that were also trained from scratch, which is not significant enough to state that MMT systems are better than their text-only counterparts (Mathur et al., 2020). Wu et al. (2021) additionally observed that while they obtained +1 BLEU on average on M30K test sets with the use of images, they got the same improvements with randomly initialized visual features, most likely due to regularization, i.e. the images were in reality not being exploited effectively. On top of that, being trained from scratch on fully supervised MMT data only, these models lag far behind state-of-the-art MT systems (Liu et al., 2020; Costa-jussà et al., 2022) trained on large amounts of parallel text data.

Futeral et al. (2023) show that M30K contained few ambiguous examples that required visual context, and that MMT models can obtain decent translation performance on the benchmark while struggling to exploit images correctly. To handle this, they introduce VGAMT, an adapted MMT model building on top of a frozen state-of-the-art MT model. They also show that visually masked language modelling (VMLM) on English captioning data was a key additional objective to force MMT systems to become truly multimodal. Sato et al. (2023) and Bowen et al. (2024) further show that choosing the masked tokens in a smart way instead of randomly slightly boosts results. However, all these methods still require the use of fully supervised data to get good translation quality.

There have been efforts to train MMT models without using fully supervised data (Su et al., 2019; Huang et al., 2020; Fei et al., 2023) in an unsupervised manner. These approaches are however fundamentally different from this work as their goal is to obtain MT models using synthetic text-only parallel data through the use of visual pivoting, not targeting disambiguation capabilities. Hirasawa et al. (2023) proposed a zero-shot method to learn MMT by training on the little fully supervised data available aiming for zero-shot cross-lingual transfer. As the amount of fully supervised data for a single language is small ( $\leq$ 30K text-image pairs), and few languages are covered ( $\leq$ 8), this method results in poor exploitation of the image to learn disambiguation capabilities.

Refer to caption — Figure 1: Overview of our approach. We train on two objectives: Visually conditioned masked language modelling (VMLM) and Kullback-Leibler (KL) divergence. All weights are frozen during training except the visual projector and the adapters in the MT model.

Evaluating MMT systems

The main test sets to evaluate MMT systems are the test subsets of Multi30K (Elliott et al., 2016, 2017; Barrault et al., 2018). However, some of the translations were produced without access to the images and they have also been found to contain only a few ambiguous examples (Futeral et al., 2023) where visual context is necessary. They are therefore not best adapted to evaluating MMT systems when used alone. Elliott (2018) and Caglayan et al. (2019) further proposed to use an adversarial evaluation method and a probing method based on masked text inputs to assess the utility of images in the translation process. While this is a good sanity check, it is not a good proxy for evaluating the capacity of an MMT model to disambiguate translations. Lala and Specia (2018), Li et al. (2021) and Zhu et al. (2023) released evaluation datasets composed of sentences in English with ambiguous words accompanied with disambiguating images. However, Li et al. (2021) only target gender ambiguity, and all these datasets are prone to distributional bias, which is difficult to measure and such that text-only MT systems can perform very well on them. Traditional MT metrics (Papineni et al., 2002; Banerjee and Lavie, 2005; Rei et al., 2020) are also unable to catch how well MMT systems use images as they do not specifically target translations where images would be required to translate correctly. Tackling these issues, Futeral et al. (2023) introduced CoMMuTE, a contrastive evaluation dataset, composed of English sentences to be translated, built around ambiguous words, each accompanied by two translations with two images, each of which disambiguates the English sentence. They use perplexity to evaluate MMT models. Therefore text-only MT systems can only perform as well as random (50%), as they do not have access to the images, and it is also a direct proxy of MMT models’ ability to use the image to translate and disambiguate.

3 Extending the CoMMuTE benchmark

	En	Ar	Ru	Zh
#unique sents.	1,155	1,310	1,310	1,310
#tokens	1,384	2,958	3,105	2,832
#unique toks.	1,559	1,870	1,002	1,762

Table 1: Statistics of the extension of CoMMuTE.

CoMMuTE (Futeral et al., 2023) is a contrastive test set designed to evaluate how well MMT systems use images to disambiguate ambiguous sentences and their translations. From an English sentence built around an ambiguous English word, two possible translations, each accompanied with one image are available. The associated image is chosen to disambiguate the English sentence such that one translation is correct and the other incorrect. Available for English-to-{French,German,Czech}, we extend CoMMuTE to three new target languages: Arabic, Chinese and Russian.⁵⁵5Professional translators were asked to translate the English sentence for each of the two images. We also release a small validation set composed of 30 English ambiguous words⁶⁶6Not overlapping with those from the test set. with two French translations, each associated with one image, to be used for model selection during training. Table 1 shows statistics of the extension of CoMMuTE.⁷⁷7Token counts were calculated with the NLLB tokenizer (Costa-jussà et al., 2022).

4 Our Approach

Our goal is to train an MMT model capable of effectively using images to disambiguate contrasting translations while maintaining its translation capabilities, without using any fully supervised data. To do so, and as shown in Figure 1, we start from a strong pretrained MT system (different versions of NLLB (Costa-jussà et al., 2022) depending on the targeted model size), and use it to translate English captions into the target languages. Similarly to Futeral et al. (2023), we then turn it into an MMT model by adding lightweight trainable modules (visual projectors and adapters), keeping original weights frozen during training. We use SigLIP (Zhai et al., 2023) to provide visual embeddings that are concatenated to the sequences of text embeddings in the NLLB encoder. We then frame the problem of training an MMT model into two sub-objectives: (1) forcing the model to use images in the translation process using a visually-conditioned masked language modeling objective (VMLM) while (2) maintaining the performance of the original MT system without any fully supervised data using the Kullback-Leibler (KL) divergence between the MMT system’s and the original MT system’s output distributions using the previously automatically translated data. We call our approach ZeroMMT.

Concretely, let $x_{1,\dots,n}$ denote the sequence of tokens of the English sentence, $i$ the image embedding, $y_{1,\dots,m}$ the translated sequence of tokens, $f_{\theta}$ the original MT system and $f_{\theta,\beta}$ the MMT system built on top of the text-only MT model with additional light-weight modules $\beta$ , both outputting probability distributions over tokens. We formally define the losses as follows: $\displaystyle\mathscr{L}_{\textit{VMLM}}=\sum_{j}y_{j}\log\big{(}f_{\theta,% \beta}(y_{j};y_{<j},x_{\setminus\mathscr{M}},i)\big{)}$ (1) $\displaystyle\mathscr{L}_{\textit{KL}}=\sum_{j}f_{\theta}(y_{j};y_{<j},x)\log% \frac{f_{\theta}(y_{j};y_{<j},x)}{f_{\theta,\beta}(y_{j};y_{<j},x,i)}$ (2)

where $\mathscr{M}$ is the set of masked input indices. The final loss is a weighted combination of (1) and (2), and we choose $\lambda$ value based on results on validation sets as described in Section 5.2:

\mathscr{L}=\mathscr{L}_{\textit{VMLM}}+\lambda\mathscr{L}_{\textit{KL}}

5 Experiments

5.1 Data

We trained our models on the Conceptual Captions dataset⁸⁸8At the time of writing, we were able to collect 2,831,746 out of the 3,300,000 images. (Sharma et al., 2018). We translated Conceptual Captions into French, German, Czech, Chinese, Russian and Arabic using the underlying NLLB (Costa-jussà et al., 2022) MT system (of size 600M, 1.3B or 3.3B depending on the experiment) using beam search (Graves, 2012) with a beam of size 4 for the 600M model and 2 for the largest ones.

We evaluated our models on the Multi30K test sets (Elliott et al., 2016, 2017; Barrault et al., 2018) for English-to-{German,French,Czech}, the EMMT test set (Zhu et al., 2023) for English-to-Chinese, comprising 500 titles of commercial products from e-commercial websites translated from English to Chinese, the VATEX test set (Wang et al., 2019) for English-to-Chinese which is composed of 10-second videos⁹⁹9We take 5 frames per second, compute SIGLIP features and average them to obtain the visual input. with English captions translated into Chinese, and finally CoMMuTE (Futeral et al., 2023) for English-to-{German,French,Czech,Chinese,Russian,Arabic}, used to test how well the MMT models exploit visual context for disambiguation.

	Fr		De		Cs		Zh
	BLEU	COMET	BLEU	COMET	BLEU	COMET	BLEU	COMET
Text-only MT baselines
NLLB-600M distilled	49.17 $\pm$ 0.78	85.18 $\pm$ 0.67	33.04 $\pm$ 3.44	81.98 $\pm$ 2.16	26.58 $\pm$ 0.19	85.02 $\pm$ 0.42	16.07 $\pm$ 1.35	57.81 $\pm$ 4.210
NLLB-1.3B	51.90 $\pm$ 0.79	86.28 $\pm$ 0.77	35.39 $\pm$ 2.83	83.49 $\pm$ 2.15	30.77 $\pm$ 0.46	87.48 $\pm$ 0.29	18.22 $\pm$ 0.20	60.02 $\pm$ 3.510
NLLB-3.3B	53.73 $\pm$ 0.57	86.98 $\pm$ 0.88	37.26 $\pm$ 2.10	84.76 $\pm$ 1.76	33.37 $\pm$ 0.27	88.70 $\pm$ 0.37	20.55 $\pm$ 0.46	61.27 $\pm$ 3.500
MMT – fully supervised
Gated Fusion bilingual	49.79 $\pm$ 7.46	80.62 $\pm$ 3.01	31.57 $\pm$ 5.24	72.89 $\pm$ 3.15	28.30 $\pm$ 2.52	79.24 $\pm$ 2.41	-	-
VTLM + MMT bilingual	55.27 $\pm$ 6.00	83.45 $\pm$ 1.98	35.94 $\pm$ 3.44	79.10 $\pm$ 2.35	32.63 $\pm$ 2.26	82.40 $\pm$ 1.77	-	-
VGAMT full bilingual	59.97 $\pm$ 6.66	88.29 $\pm$ 1.83	39.10 $\pm$ 3.14	85.72 $\pm$ 1.73	35.89 $\pm$ 1.70	89.50 $\pm$ 1.08	-	-
VGAMT SIGLIP-only multi.	58.39 $\pm$ 5.67	87.27 $\pm$ 1.74	37.36 $\pm$ 3.51	83.85 $\pm$ 2.04	34.88 $\pm$ 1.77	87.45 $\pm$ 1.19	-	-
MMT – zero-shot
Multilingual OpenFlamingo	35.08 $\pm$ 0.76	82.66 $\pm$ 1.38	24.92 $\pm$ 2.89	79.93 $\pm$ 2.44	03.27 $\pm$ 0.04	70.73 $\pm$ 0.55	08.60 $\pm$ 5.86	53.38 $\pm$ 10.24
ZeroMMT-600M (ours) multi.	49.00 $\pm$ 1.07	84.82 $\pm$ 0.79	32.79 $\pm$ 2.97	81.13 $\pm$ 2.48	25.24 $\pm$ 0.62	83.79 $\pm$ 0.55	15.74 $\pm$ 1.62	57.10 $\pm$ 4.720
ZeroMMT-1.3B (ours) multi.	52.06 $\pm$ 1.15	86.15 $\pm$ 0.84	35.18 $\pm$ 2.58	83.35 $\pm$ 1.90	30.14 $\pm$ 0.48	86.94 $\pm$ 0.33	17.11 $\pm$ 0.71	59.17 $\pm$ 4.340
ZeroMMT-3.3B (ours) multi.	53.34 $\pm$ 0.50	86.69 $\pm$ 0.94	37.08 $\pm$ 2.49	84.41 $\pm$ 1.77	33.03 $\pm$ 0.34	88.37 $\pm$ 0.32	19.43 $\pm$ 0.64	60.61 $\pm$ 4.280

Table 2: Aggregated generation results for En

\rightarrow

X. Fr and De results are averaged over Test2016, Test2017 from Multi30K and AmbiguousCOCO. Cs results are averaged over Multi30K Test2016 and Test2018. Zh results are averaged over EMMT and VATEX test sets.

	Ar	Cs	De	Fr	Ru	Zh
Text-only MT baselines	50.0	50.0	50.0	50.0	50.0	50.0
NLLB-SIGLIP topline	82.6	76.0	83.7	75.0	75.8	88.1
	MMT – fully supervised
Gated Fusion bilingual	-	51.0 $\pm$ 1.9	49.7 $\pm$ 0.6	50.0 $\pm$ 0.8	-	-
VTLM + MMT bilingual	-	52.0 $\pm$ 0.7	50.2 $\pm$ 0.3	51.4 $\pm$ 0.9	-	-
VGAMT full bilingual	-	55.6 $\pm$ 0.8	59.0 $\pm$ 0.5	67.1 $\pm$ 0.7	-	-
VGAMT SIGLIP-only multi.	-	57.5 $\pm$ 1.2	57.1 $\pm$ 0.4	61.3 $\pm$ 1.1	-	-
	MMT – zero-shot
Multilingual OpenFlamingo	61.3 $\pm$ 1.2	59.1 $\pm$ 1.2	63.7 $\pm$ 1.2	68.5 $\pm$ 1.2	67.4 $\pm$ 1.2	66.5 $\pm$ 1.2
ZeroMMT-600M (ours) multi.	56.1 $\pm$ 0.8	55.5 $\pm$ 0.5	55.7 $\pm$ 0.3	58.7 $\pm$ 0.4	57.2 $\pm$ 1.2	58.2 $\pm$ 1.1
ZeroMMT-1.3B (ours) multi.	57.3 $\pm$ 0.2	59.4 $\pm$ 0.5	57.4 $\pm$ 0.4	62.2 $\pm$ 0.5	60.6 $\pm$ 0.5	60.1 $\pm$ 0.8
ZeroMMT-3.3B (ours) multi.	58.9 $\pm$ 0.5	61.7 $\pm$ 0.3	60.8 $\pm$ 0.8	65.0 $\pm$ 0.7	62.9 $\pm$ 0.3	60.1 $\pm$ 0.7

Table 3: Results on CoMMuTE, averaged over 3 runs (

\pm

standard error).

5.2 Implementation details

Modelling

We trained three different ZeroMMT with different sizes (600M, 1.3B and 3.3B) with NLLB (Costa-jussà et al., 2022) models as implemented in the Transformers library (Wolf et al., 2020) as the underlying MT system. For SigLIP (Zhai et al., 2023), we use ViT-B-16-SigLIP-384 trained on WebLI (Chen et al., 2023) from the timm library (Wightman, 2019). Following VGAMT (Futeral et al., 2023), we used bottleneck adapters (Houlsby et al., 2019) as implemented in the Adapters Python library (Poth et al., 2023) with a factor reduction of 8 and ReLU activation (Agarap, 2018) for each layer. The visual projector is a 1-layer neural network followed by ReLU activation projecting SigLIP (Zhai et al., 2023) embeddings towards the hidden dimension of NLLB. The image representation is then concatenated to the sequence of text embeddings. The cross-attention mechanism in the decoder of the sequence-to-sequence model can only attend to the positions of text embeddings. Similarly to VGAMT, we randomly mask 25% of the input tokens for VMLM.

Training

We train our models with a batch size of 32, the Adam optimizer (Kingma and Ba, 2015) with $\beta_{1}=0.9$ and $\beta_{2}=0.99$ and learning rate of $10^{-4}$ . We use $\lambda=0.1$ to balance the two training losses. All hyperparameters were selected based on the combination of the CoMMuTE validation set (see Section 3) and the English–French validation dataset of Multi30K, each score weighted equally. All our models are multilingual if not otherwise specified. We run each experiment three times with three different seeds and report average scores and standard error. It took 15 hours on one NVIDIA V100 for the 600M model and 20 hours on one NVIDIA A100 for the largest models.

Evaluation

We evaluate MMT generation with BLEU (Papineni et al., 2002) and COMET (Rei et al., 2020). For BLEU, we use the Sacrebleu implementation (Post, 2018) with 13a tokenization for French, German and Czech and zh tokenization for Chinese. For COMET, we use the wmt22-comet-da (Rei et al., 2022) model from the XLM-R backbone (Conneau et al., 2020). The translations were obtained with beam search decoding of size 4 (Graves, 2012). Following (Futeral et al., 2023), we calculate disambiguation accuracy using CoMMuTE. More precisely, given an English sentence and an associated image, we compute the perplexities of each contrastive translations and the example score is 1 if the perplexity of the correct translation is lower than the perplexity of the contrastive one and 0 otherwise.

5.3 Results

Baselines and comparative models

We compare our approach to several others. Firstly, we compare to the text-only MT systems on which the ZeroMMT models are based, NLLB-600M distilled, NLLB-1.3B and NLLB-3.3B (Costa-jussà et al., 2022). We also compare against well-known fully supervised MMT systems: Gated Fusion (Wu et al., 2021), a tiny 3M-parameter Transformer trained from scratch on the M30K; VTLM (+ MMT) (Caglayan et al., 2021), a 44M-parameter MMT system first pretrained on the translation language modelling (TLM) objective (Lample and Conneau, 2019) with an additional image as input on translated captioning data¹⁰¹⁰10We use the same translated data and tokenizer as in our approach. and then MMT-finetuned on M30K; and VGAMT (Futeral et al., 2023), a 630M-parameter MMT system (of which 13M trainable), which is an MT-fine-tuned mBART (Liu et al., 2020) transformed into an MMT system through the addition of lightweight adapters trained jointly on the MMT and VMLM objectives. VGAMT originally uses multiple types of visual input and is bilingual. Therefore, to have a comparable setup we retrain a VGAMT-like model with NLLB-600M distilled as the underlying MT model, with SIGLIP features only and in a multilingual setting. Finally, we compare to Multilingual OpenFlamingo (Futeral et al., 2024), a 3B multilingual multimodal language model pretrained on a large number of text-image pairs and interleaved documents which allow for zero-shot MMT, in a way that is comparable with what our model does.

We also compute an approximate upperbound on CoMMuTE for models trained with SIGLIP features by evaluating on NLLB-SIGLIP (Visheratin, 2023). Concretely, we compute the cosine similarity between the translation correctly associated with the image and the cosine similarity between the translation wrongly associated with the image for each instance of CoMMuTE. If the cosine similarity of the correct translation is higher than the wrong one, it is considered a correct prediction.

Quantitative results

Tables 2 and 3 show the aggregated results on generation benchmarks and CoMMuTE respectively. Full results can be found in Appendix A. Compared to the text-only NLLB-600M distilled model, we observe that our approach results in only a small drop in performance on generation benchmarks (-0.52 BLEU points and -0.79 COMET points on average), where images are not required to translate the sentence correctly, despite not using the Multi30k training data or any fully supervised data. For the disambiguation task, Multilingual OpenFlamingo obtains the strongest CoMMuTE scores but it fails in generation as it was not specifically trained to translate. Our approach is significantly better than the random baseline ( $>$ 55% for all languages for the smallest model, $>$ 61% on average for the largest model), showing that it is able to exploit images for disambiguation; results are close to VGAMT scores for Czech despite not having been trained on fully supervised data. Table 4 shows additional results on CoMMuTE but used as a traditional MMT generation benchmark. We obtain higher COMET scores than the text-only baseline NLLB on all languages for all model sizes except Czech for the smallest model. These results show that our approach is able to maintain good translation performance whilst still being able to exploit visual information for disambiguation. We shall see in Section 7 how the trade-off between translation quality and image exploitation can be controlled and how our approach can be made to match or outperform Multilingual OpenFlamingo both in translation quality (as measured by BLEU and COMET) and in image exploitation (as measured on CoMMuTe).

	Ar	Cs	De	Fr	Ru	Zh
NLLB-600M	79.06	83.31	80.84	80.17	79.94	75.41
ZeroMMT-600M (ours)	79.43	82.62	81.11	81.36	80.67	75.90
NLLB-1.3B	80.59	84.12	81.28	80.14	81.72	75.30
ZeroMMT-1.3B	81.02	85.41	83.46	82.35	83.45	77.06
NLLB-3.3B	79.90	84.97	81.20	81.16	82.26	76.69
ZeroMMT-3.3B	80.64	86.69	83.54	83.40	83.87	77.85

Table 4: COMET scores on CoMMuTE (as a classic generation task). The best result for each model size is in bold.

Qualitative results

We analysed some translations of our ZeroMMT-600M model and compared it with the text-only distilled NLLB-600M model. We notice that for all languages, our model is able to exploit the image to slightly change the translation towards the correct meaning, as illustrated by Figures 2(c), 2(d), 2(a), 3(a), 3(b) and 2(b), where ambiguous parts of the translations change given the image fed to the model. In the example (2) of Figure 2(b), the model exploits the image to change the translation from trenéry ‘trainer’ to autobusy ‘bus’, which, although not correct, is closer to the reference translation. Again, in Figure 2(c), the translation is improved, with the word bass being translated as {CJK}UTF8gbsn鱼 ‘fish’ rather than as {CJK}UTF8gbsn低音 ‘bass (low tone)’. We also notice very few variations in the other areas of the translation in comparison with the NLLB translation, which means that our model correctly identifies the part to change.

6 Ablation study

	Translation sets		CoMMuTE
	BLEU	COMET	accuracy
ZeroMMT-600M	32.73 $\pm$ 12.33	77.95 $\pm$ 10.82	56.9 $\pm$ 1.4
w/o VMLM	33.12 $\pm$ 12.01	78.49 $\pm$ 10.69	50.3 $\pm$ 0.4
w/o KL	14.10 $\pm$ 10.70	65.88 $\pm$ 11.72	58.9 $\pm$ 1.8
+ MMT w/o KL	32.09 $\pm$ 12.62	77.50 $\pm$ 10.80	55.5 $\pm$ 1.3

Table 5: Ablation study. Aggregated scores over benchmarks and languages. The best results are in bold and second best are underlined.

We conduct an ablation study on our ZeroMMT-600M model to analyse the impact of our two objectives on the results observed in Section 5.3. We first train a model without the VMLM objective, then a model without the KL penalty. We also test the replacement of the KL penalty with a standard auto-regressive MMT translation loss with the translated data as the ground truth, and finally we vary the KL penalty coefficient to observe the evolution of COMET and CoMMuTE scores.

KL penalty only (i.e. without VMLM)

In Table 5, we show that with the KL penalty only, the model is not capable of exploiting visual information for translation. This is because there is no need for the model to use the input image and the model learns to ignore it. The aggregated CoMMuTE score of 50.3 is close to random guessing.

VMLM only (i.e. without KL)

Table 5 also shows that, while the VMLM objective allows the model to obtain good scores on CoMMuTE (it is able to exploit visual information), the scores on generation benchmarks collapse as expected, with -19 BLEU points and -12 COMET points in comparison to the full approach.

KL penalty vs. MMT objective

Finally, we replace the KL penalty with the standard MMT objective (i.e. +MMT w/O KL in Table 5) as the objective to maintain translation quality. We observe that the MMT translation objective sees a drop of 0.64 BLEU points and 0.45 COMET points on average in comparison with the use of the KL penalty. It additionally results in an average drop in performance of 1.4 points on CoMMuTE.

Varying the trade-off between objectives

In Figure 4 we show the variation of COMET and CoMMuTE when testing our approach with different $\lambda$ coefficients for the KL penalty. We notice that when $\lambda$ is too high, it results in a large average drop of performance on CoMMuTE.

7 Controlling disambiguation level at inference time

ZeroMMT-600M
$\gamma$	BLEU	COMET	CoMMuTE
1.0	32.73 $\pm$ 12.33	77.95 $\pm$ 10.82	56.9 $\pm$ 1.4
1.25	32.39 $\pm$ 12.24	77.73 $\pm$ 10.76	58.4 $\pm$ 1.4
1.5	31.81 $\pm$ 12.04	77.39 $\pm$ 10.66	59.7 $\pm$ 1.8
2.0	30.29 $\pm$ 11.52	76.35 $\pm$ 10.46	62.3 $\pm$ 1.9
2.5	27.98 $\pm$ 10.75	74.68 $\pm$ 10.09	64.1 $\pm$ 2.1
3.0	25.03 $\pm$ 9.560	72.29 $\pm$ 9.580	65.4 $\pm$ 2.2
ZeroMMT-1.3B
1.0	35.62 $\pm$ 12.58	80.07 $\pm$ 10.78	59.5 $\pm$ 1.8
1.25	35.25 $\pm$ 12.56	79.89 $\pm$ 10.73	61.6 $\pm$ 1.3
1.5	34.67 $\pm$ 12.42	79.62 $\pm$ 10.65	63.8 $\pm$ 2.8
2.0	32.96 $\pm$ 11.92	78.72 $\pm$ 10.40	66.1 $\pm$ 2.4
2.5	30.36 $\pm$ 11.29	77.09 $\pm$ 10.04	68.0 $\pm$ 2.4
3.0	27.06 $\pm$ 10.21	74.60 $\pm$ 9.490	69.2 $\pm$ 2.0
ZeroMMT-3.3B
1.0	37.62 $\pm$ 12.11	81.12 $\pm$ 10.59	61.6 $\pm$ 2.1
1.25	37.30 $\pm$ 12.07	80.98 $\pm$ 10.56	64.2 $\pm$ 2.5
1.5	36.84 $\pm$ 11.89	80.76 $\pm$ 10.53	65.8 $\pm$ 2.7
2.0	35.02 $\pm$ 11.07	79.92 $\pm$ 10.42	68.5 $\pm$ 2.9
2.5	32.15 $\pm$ 10.14	78.42 $\pm$ 10.16	70.3 $\pm$ 2.6
3.0	28.75 $\pm$ 9.080	76.25 $\pm$ 9.690	71.7 $\pm$ 2.5
MOF	20.37 $\pm$ 12.92	73.60 $\pm$ 11.98	64.4 $\pm$ 3.4

Table 6: Evolution of BLEU, COMET and CoMMuTE aggregated scores reached by our CFG-enabled ZeroMMT model over benchmarks and languages compared to the vanilla, CFG-free ZeroMMT model (i.e. with

\gamma=1.0

). Kühn is best result for each model size. Underline is second best. The “MOF” line gives the corresponding scores for Multilingual OpenFlamingo.

We show that our method allows us to obtain an MMT system with a good trade-off between keeping strong translation quality and having the capacity to exploit visual context for disambiguation. However, some applications could require stronger disambiguation capabilities and be less reliant on translation fidelity or vice versa. Instead of retraining a model to control the trade-off between the two objectives, we instead propose to use classifier-free guidance (CFG) (Ho and Salimans, 2021; Sanchez et al., 2023) to control this trade-off without retraining or using fully supervised data. We define CFG in the context of MMT as follows:

\displaystyle\begin{split}\widehat{f}_{\theta,\beta}(y_{j};y_{<j},x,i)=f_{% \theta}(y_{j};y_{<j},x)+\\ \gamma\big{(}f_{\theta,\beta}(y_{j};y_{<j},x,i)-f_{\theta}(y_{j};y_{<j},x)\big% {)}\end{split}

(3)

where $f_{\theta}$ is the text-only MT system, $f_{\theta,\beta}$ is the adapted MMT system, $x$ is the source sentence, $y$ is the generated sentence, $i$ is the visual input, $j$ is the token index and $\gamma$ is the CFG value that controls the guidance.

We run experiments to observe the evolution of BLEU and COMET scores on generation benchmarks and CoMMuTE scores when varying the $\gamma$ parameter. We show in Table 6 that for ZeroMMT-600M we can achieve a boost in CoMMuTE accuracy of up to 2.8 points for $\gamma=$ 1.5, while facing only a moderate drop of BLEU and COMET scores. Higher $\gamma$ values result in stronger disambiguation capabilities as shown by CoMMuTE but this comes at the expense of a drop in generation quality. CFG can therefore allow us to control the trade-off between disambiguation capabilities and translation fidelity depending on the application. Importantly, when $\gamma\geq 2.5$ , ZeroMMT-600M matches or outperforms Multilingual OpenFlamingo on both translation quality (as measured by BLEU and COMET) and image exploitation (as measured on CoMMuTE) while being five times smaller. With larger models, we strongly outperform Multilingual OpenFlamingo on all metrics for different CFG values and we can obtain CoMMuTE scores up to 71.7 on average for $\gamma=$ 3.0 for ZeroMMT-3.3B.

8 Conclusion

We present ZeroMMT, a novel zero-shot MMT approach to bypass the need for fully supervised data when training MMT models. We observe good disambiguation capabilities such that ZeroMMT is able to effectively exploit images in order to translate English sentences correctly. At the same time, the approach manages to maintain a good translation performance, with only a very small drop in performance according to standard generation MMT benchmarks where images are not necessary to translate the sentence correctly. ZeroMMT allows us to extend MMT to new languages as it performs well on CoMMuTE for Russian and Arabic for which no fully supervised MMT training data is available. Moreover, we show that it is possible to control the disambiguation-generation trade-off using classifier-free guidance. It is therefore a step towards having MMT systems that cover a broader set of languages without having to rely on acquiring costly training data.

Limitations

While our approach allows us to exploit images for translation disambiguation as shown by the scores obtained on CoMMuTE, it is still behind the upperbound. Zero-shot disambiguation capabilities also come at the expense of a slight drop in translation quality as shown by BLEU and COMET scores. There are therefore areas for improvement even if our zero-shot approach is close to its fully supervised counterparts. It is nevertheless a step towards zero-shot multimodal machine translation.

Ethics Statement

The released extension of CoMMuTE is designed to evaluate disambiguation capabilities of MMT systems and should not be used in any other way. Images were collected under the Creative Commons license and CoMMuTE is distributed under CC-BY-SA-4.0 license. All of our models are also distributed under CC-BY-SA-4.0 license.

Acknowledgements

This work was granted access to the HPC resources of IDRIS under the allocation 2023-AD011013908R1 and 2023-AD011012254 made by GENCI. It was also partly funded by the last three authors’ chairs in the PRAIRIE institute funded by the French national agency ANR as part of the “Investissements d’avenir” programme under the reference ANR-19-P3IA-0001.

References

Agarap (2018) Abien Fred Agarap. 2018. Deep Learning using Rectified Linear Units (ReLU). arXiv preprint arXiv:1803.08375.
Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
Barrault et al. (2018) Loïc Barrault, Fethi Bougares, Lucia Specia, Chiraag Lala, Desmond Elliott, and Stella Frank. 2018. Findings of the third shared task on multimodal machine translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 304–323, Belgium, Brussels. Association for Computational Linguistics.
Bowen et al. (2024) Braeden Bowen, Vipin Vijayan, Scott Grigsby, Timothy Anderson, and Jeremy Gwinnup. 2024. Detecting concrete visual tokens for multimodal machine translation. arXiv preprint arXiv:2403.03075.
Caglayan et al. (2021) Ozan Caglayan, Menekse Kuyu, Mustafa Sercan Amac, Pranava Madhyastha, Erkut Erdem, Aykut Erdem, and Lucia Specia. 2021. Cross-lingual visual pre-training for multimodal machine translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1317–1324, Online. Association for Computational Linguistics.
Caglayan et al. (2019) Ozan Caglayan, Pranava Madhyastha, Lucia Specia, and Loïc Barrault. 2019. Probing the need for visual context in multimodal machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4159–4170, Minneapolis, Minnesota. Association for Computational Linguistics.
Calixto et al. (2016) Iacer Calixto, Desmond Elliott, and Stella Frank. 2016. DCU-UvA multimodal MT system report. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 634–638, Berlin, Germany. Association for Computational Linguistics.
Calixto and Liu (2017) Iacer Calixto and Qun Liu. 2017. Incorporating global visual features into attention-based neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 992–1003, Copenhagen, Denmark. Association for Computational Linguistics.
Chen et al. (2023) Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish V Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme Ruiz, Andreas Peter Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. 2023. PaLI: A jointly-scaled multilingual language-image model. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali Rwanda.
Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
Costa-jussà et al. (2022) Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
Elliott (2018) Desmond Elliott. 2018. Adversarial evaluation of multimodal machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2974–2978, Brussels, Belgium. Association for Computational Linguistics.
Elliott et al. (2017) Desmond Elliott, Stella Frank, Loïc Barrault, Fethi Bougares, and Lucia Specia. 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers, pages 215–233, Copenhagen, Denmark. Association for Computational Linguistics.
Elliott et al. (2016) Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. 2016. Multi30K: Multilingual English-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language, pages 70–74, Berlin, Germany. Association for Computational Linguistics.
Elliott and Kádár (2017) Desmond Elliott and Ákos Kádár. 2017. Imagination improves multimodal translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 130–141, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Fei et al. (2023) Hao Fei, Qian Liu, Meishan Zhang, Min Zhang, and Tat-Seng Chua. 2023. Scene graph as pivoting: Inference-time image-free unsupervised multimodal machine translation with visual scene hallucination. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5980–5994, Toronto, Canada. Association for Computational Linguistics.
Futeral et al. (2023) Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot, and Rachel Bawden. 2023. Tackling ambiguity with images: Improved multimodal machine translation and contrastive evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5394–5413, Toronto, Canada. Association for Computational Linguistics.
Futeral et al. (2024) Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, Rémi Lacroix, Cordelia Schmid, Rachel Bawden, and Benoît Sagot. 2024. mOSCAR: A large-scale multilingual and multimodal document-level corpus. arXiv preprint arXiv:2406.08707.
Graves (2012) Alex Graves. 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.
Gupta et al. (2023) Devaansh Gupta, Siddhant Kharbanda, Jiawei Zhou, Wanhua Li, Hanspeter Pfister, and Donglai Wei. 2023. CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Hirasawa et al. (2023) Tosho Hirasawa, Emanuele Bugliarello, Desmond Elliott, and Mamoru Komachi. 2023. Visual prediction improves zero-shot cross-modal machine translation. In Proceedings of the Eighth Conference on Machine Translation, pages 522–535, Singapore. Association for Computational Linguistics.
Ho and Salimans (2021) Jonathan Ho and Tim Salimans. 2021. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR.
Huang et al. (2020) Po-Yao Huang, Junjie Hu, Xiaojun Chang, and Alexander Hauptmann. 2020. Unsupervised multimodal neural machine translation with pseudo visual pivoting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8226–8237, Online. Association for Computational Linguistics.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR (Poster).
Lala and Specia (2018) Chiraag Lala and Lucia Specia. 2018. Multimodal lexical translation. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
Lample and Conneau (2019) Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
Li et al. (2022a) Bei Li, Chuanhao Lv, Zefan Zhou, Tao Zhou, Tong Xiao, Anxiang Ma, and JingBo Zhu. 2022a. On vision features in multimodal machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6327–6337, Dublin, Ireland. Association for Computational Linguistics.
Li et al. (2021) Jiaoda Li, Duygu Ataman, and Rico Sennrich. 2021. Vision matters when it should: Sanity checking multimodal machine translation models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8556–8562, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Li et al. (2019) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. CoRR, abs/1908.03557.
Li et al. (2022b) Yi Li, Rameswar Panda, Yoon Kim, Chun-Fu (Richard) Chen, Rogerio S. Feris, David Cox, and Nuno Vasconcelos. 2022b. Valhalla: Visual hallucination for machine translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5216–5226.
Libovický et al. (2016) Jindřich Libovický, Jindřich Helcl, Marek Tlustý, Ondřej Bojar, and Pavel Pecina. 2016. CUNI system for WMT16 automatic post-editing and multimodal translation tasks. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 646–654, Berlin, Germany. Association for Computational Linguistics.
Lin et al. (2020) Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, and Jiebo Luo. 2020. Dynamic context-guided capsule network for multimodal machine translation. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1320–1329.
Liu et al. (2021) Pengbo Liu, Hailong Cao, and Tiejun Zhao. 2021. Gumbel-attention for multi-modal machine translation. arXiv preprint arXiv:2103.08862.
Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Mathur et al. (2020) Nitika Mathur, Timothy Baldwin, and Trevor Cohn. 2020. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
Poth et al. (2023) Clifton Poth, Hannah Sterz, Indraneil Paul, Sukannya Purkayastha, Leon Engländer, Timo Imhof, Ivan Vulić, Sebastian Ruder, Iryna Gurevych, and Jonas Pfeiffer. 2023. Adapters: A unified library for parameter-efficient and modular transfer learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 149–160, Singapore. Association for Computational Linguistics.
Rei et al. (2022) Ricardo Rei, José G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and André F. T. Martins. 2022. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Rei et al. (2020) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
Sanchez et al. (2023) Guillaume Sanchez, Honglu Fan, Alexander Spangher, Elad Levi, Pawan Sasanka Ammanamanchi, and Stella Biderman. 2023. Stay on topic with classifier-free guidance. arXiv preprint arXiv:2306.17806.
Sato et al. (2023) Julia Sato, Helena Caseli, and Lucia Specia. 2023. Choosing what to mask: More informed masking for multimodal machine translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 244–253, Toronto, Canada. Association for Computational Linguistics.
Sharma et al. (2018) Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
Su et al. (2019) Yuanhang Su, Kai Fan, Nguyen Bach, C-C Jay Kuo, and Fei Huang. 2019. Unsupervised multi-modal neural machine translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10482–10491.
Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
Vijayan et al. (2024) Vipin Vijayan, Braeden Bowen, Scott Grigsby, Timothy Anderson, and Jeremy Gwinnup. 2024. Adding multimodal capabilities to a text-only translation model. arXiv preprint arXiv:2403.03045.
Visheratin (2023) Alexander Visheratin. 2023. NLLB-CLIP–train performant multilingual image retrieval model on a budget. arXiv preprint arXiv:2309.01859.
Wang et al. (2019) Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591.
Wightman (2019) Ross Wightman. 2019. Pytorch image models. https://github.com/rwightman/pytorch-image-models.
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
Wu et al. (2021) Zhiyong Wu, Lingpeng Kong, Wei Bi, Xiang Li, and Ben Kao. 2021. Good for misconceived reasons: An empirical revisiting on the need for visual context in multimodal machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6153–6166, Online. Association for Computational Linguistics.
Yao and Wan (2020) Shaowei Yao and Xiaojun Wan. 2020. Multimodal transformer for multimodal machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4346–4350, Online. Association for Computational Linguistics.
Yin et al. (2020) Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, and Jiebo Luo. 2020. A novel graph-based multi-modal fusion encoder for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3025–3035, Online. Association for Computational Linguistics.
Zhai et al. (2023) Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986.
Zhu et al. (2023) Yaoming Zhu, Zewei Sun, Shanbo Cheng, Luyang Huang, Liwei Wu, and Mingxuan Wang. 2023. Beyond triplet: Leveraging the most data for multimodal machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2679–2697, Toronto, Canada. Association for Computational Linguistics.

Appendix A Detailed results

A.1 Main results

Tables 7, 8, 9 and 10 show full BLEU and COMET scores for all languages and all benchmarks.

Text-only MT baselines
	Test2016		Test2017		COCO
	BLEU	COMET	BLEU	COMET	BLEU	COMET
NLLB-600M distilled	48.71	85.18	48.54	85.99	50.28	84.36
NLLB-1.3B	51.68	86.60	51.06	87.01	52.95	85.22
NLLB-3.3B	54.15	87.57	52.92	87.64	54.11	85.73
MMT – fully supervised
Gated Fusion bilingual	58.70 $\pm$ 0.30	83.60 $\pm$ 0.08	50.80 $\pm$ 0.70	81.74 $\pm$ 0.25	40.40 $\pm$ 0.40	76.52 $\pm$ 0.25
VTLM + MMT bilingual	63.37 $\pm$ 0.13	85.29 $\pm$ 0.06	55.77 $\pm$ 0.17	84.35 $\pm$ 0.04	47.69 $\pm$ 0.16	80.70 $\pm$ 0.20
VGAMT full bilingual	67.20 $\pm$ 0.10	89.78 $\pm$ 0.04	61.60 $\pm$ 0.10	89.37 $\pm$ 0.04	51.10 $\pm$ 0.60	85.78 $\pm$ 0.11
VGAMT SIGLIP-only multi.	65.04 $\pm$ 0.52	88.74 $\pm$ 0.04	58.90 $\pm$ 0.28	88.23 $\pm$ 0.19	51.24 $\pm$ 0.73	84.84 $\pm$ 0.29
MMT – zero-shot
Multilingual OpenFlamingo	36.01 $\pm$ 0.38	83.56 $\pm$ 0.38	35.10 $\pm$ 0.38	83.72 $\pm$ 0.38	34.14 $\pm$ 0.38	80.71 $\pm$ 0.38
ZeroMMT-600M (ours) multi.	48.62 $\pm$ 0.38	84.92 $\pm$ 0.09	48.10 $\pm$ 0.11	85.66 $\pm$ 0.16	50.29 $\pm$ 0.82	83.78 $\pm$ 0.20
ZeroMMT-1.3B (ours) multi.	51.47 $\pm$ 0.11	86.42 $\pm$ 0.17	51.10 $\pm$ 0.02	87.00 $\pm$ 0.17	53.60 $\pm$ 0.54	85.03 $\pm$ 0.08
ZeroMMT-3.3B (ours) multi.	52.89 $\pm$ 0.36	87.22 $\pm$ 0.05	53.29 $\pm$ 0.19	87.48 $\pm$ 0.13	53.86 $\pm$ 0.30	85.38 $\pm$ 0.13

Table 7: En

\rightarrow

Fr results for Test2016, Test2017 and COCO subsets of Multi30k, avg. over 3 runs (

\pm

standard error).

Text-only MT baselines
	Test2016		Test2017		COCO
	BLEU	COMET	BLEU	COMET	BLEU	COMET
NLLB-600M distilled	37.14	83.79	33.24	83.21	28.73	78.95
NLLB-1.3B	37.91	85.14	36.81	84.86	31.44	80.45
NLLB-3.3B	39.47	86.22	37.86	85.76	34.44	82.28
MMT – fully supervised
Gated Fusion bilingual	38.70 $\pm$ 0.20	76.32 $\pm$ 0.17	29.50 $\pm$ 0.20	73.61 $\pm$ 0.32	26.60 $\pm$ 0.30	68.74 $\pm$ 0.36
VTLM + MMT bilingual	40.46 $\pm$ 0.64	81.58 $\pm$ 0.08	35.19 $\pm$ 0.16	79.79 $\pm$ 0.06	32.18 $\pm$ 0.21	75.94 $\pm$ 0.08
VGAMT full bilingual	43.30 $\pm$ 0.20	87.34 $\pm$ 0.08	38.30 $\pm$ 0.20	86.49 $\pm$ 0.07	35.70 $\pm$ 0.30	83.33 $\pm$ 0.08
VGAMT SIGLIP-only multi.	41.93 $\pm$ 0.75	85.79 $\pm$ 0.13	36.68 $\pm$ 0.23	84.72 $\pm$ 0.27	33.48 $\pm$ 0.13	81.05 $\pm$ 0.29
MMT – zero-shot
Multilingual OpenFlamingo	28.86 $\pm$ 0.38	82.31 $\pm$ 0.38	23.91 $\pm$ 0.38	80.91 $\pm$ 0.38	21.99 $\pm$ 0.38	76.58 $\pm$ 0.38
ZeroMMT-600M (ours) multi.	36.22 $\pm$ 0.40	83.04 $\pm$ 0.39	33.11 $\pm$ 0.68	82.54 $\pm$ 0.17	29.04 $\pm$ 0.13	77.72 $\pm$ 0.16
ZeroMMT-1.3B (ours) multi.	37.63 $\pm$ 0.13	84.80 $\pm$ 0.19	36.24 $\pm$ 0.54	84.56 $\pm$ 0.19	31.66 $\pm$ 0.47	80.68 $\pm$ 0.14
ZeroMMT-3.3B (ours) multi.	39.58 $\pm$ 0.30	85.85 $\pm$ 0.05	37.97 $\pm$ 0.21	85.46 $\pm$ 0.13	33.71 $\pm$ 0.40	81.92 $\pm$ 0.16

Table 8: En

\rightarrow

De results for Test2016, Test2017 and COCO subsets of Multi30k, avg. over 3 runs (

\pm

standard error).

Text-only MT baselines
	Test2016		Test2018
	BLEU	COMET	BLEU	COMET
NLLB-600M distilled	26.39	85.44	26.76	84.60
NLLB-1.3B	30.31	87.77	31.23	87.19
NLLB-3.3B	33.64	89.08	33.10	88.33
MMT – fully supervised
Gated Fusion bilingual	30.80 $\pm$ 0.40	81.64 $\pm$ 0.32	25.80 $\pm$ 0.10	76.85 $\pm$ 0.18
VTLM + MMT bilingual	34.87 $\pm$ 0.19	84.15 $\pm$ 0.17	30.38 $\pm$ 0.35	80.64 $\pm$ 0.20
VGAMT full bilingual	37.60 $\pm$ 0.20	90.57 $\pm$ 0.08	34.20 $\pm$ 0.10	88.43 $\pm$ 0.06
VGAMT SIGLIP-only multi.	36.62 $\pm$ 0.42	88.63 $\pm$ 0.16	33.13 $\pm$ 0.23	86.28 $\pm$ 0.11
MMT – zero-shot
Multilingual OpenFlamingo	03.22 $\pm$ 0.38	71.27 $\pm$ 0.38	03.31 $\pm$ 0.38	70.18 $\pm$ 0.38
ZeroMMT-600M (ours) multi.	25.66 $\pm$ 0.43	84.27 $\pm$ 0.36	24.82 $\pm$ 0.49	83.32 $\pm$ 0.14
ZeroMMT-1.3B (ours) multi.	29.98 $\pm$ 0.59	87.13 $\pm$ 0.27	30.29 $\pm$ 0.25	86.75 $\pm$ 0.28
ZeroMMT-3.3B (ours) multi.	32.99 $\pm$ 0.38	88.67 $\pm$ 0.07	33.08 $\pm$ 0.30	88.06 $\pm$ 0.11

Table 9: En

\rightarrow

Cs results for Test2016 and Test2018 subsets of Multi30k, avg. over 3 runs (

\pm

standard error).

Text-only MT baselines
	EMMT		VATEX
	BLEU	COMET	BLEU	COMET
NLLB-600M distilled	14.72	53.60	17.42	62.03
NLLB-1.3B	18.42	56.51	18.02	63.54
NLLB-3.3B	21.01	57.77	20.09	64.77
MMT – zero-shot
Multilingual OpenFlamingo	02.74 $\pm$ 0.38	43.14 $\pm$ 0.38	14.46 $\pm$ 0.38	63.62 $\pm$ 0.38
ZeroMMT-600M (ours) multi.	14.12 $\pm$ 0.09	52.39 $\pm$ 0.07	17.36 $\pm$ 0.13	61.82 $\pm$ 0.12
ZeroMMT-1.3B (ours) multi.	16.42 $\pm$ 0.15	54.84 $\pm$ 0.44	17.80 $\pm$ 0.16	63.50 $\pm$ 0.16
ZeroMMT-3.3B (ours) multi.	18.97 $\pm$ 0.59	56.34 $\pm$ 0.50	19.88 $\pm$ 0.25	64.87 $\pm$ 0.07

Table 10: En

\rightarrow

Zh results for EMMT and VATEX test sets, averaged over 3 runs (

\pm

standard error).

A.2 Ablation study

Tables 11, 12, 13 and 14 show the full results of the ablation study for all languages and all benchmarks.

	Test2016		Test2017		COCO		CoMMuTE
	BLEU	COMET	BLEU	COMET	BLEU	COMET	Accuracy
ZeroMMT-600M	48.62 $\pm$ 0.38	84.92 $\pm$ 0.09	48.10 $\pm$ 0.11	85.66 $\pm$ 0.16	50.29 $\pm$ 0.82	83.78 $\pm$ 0.20	58.7 $\pm$ 0.4
w/o VMLM	49.01 $\pm$ 0.16	85.09 $\pm$ 0.04	47.64 $\pm$ 0.19	85.59 $\pm$ 0.04	49.92 $\pm$ 0.28	83.91 $\pm$ 0.02	50.0 $\pm$ 0.3
w/o KL	28.73 $\pm$ 5.97	78.51 $\pm$ 2.96	23.63 $\pm$ 5.99	76.95 $\pm$ 3.40	30.78 $\pm$ 6.48	76.50 $\pm$ 2.98	60.4 $\pm$ 1.4
+ MMT w/o KL	48.68 $\pm$ 0.22	84.64 $\pm$ 0.21	47.64 $\pm$ 0.35	85.37 $\pm$ 0.05	49.40 $\pm$ 0.19	83.11 $\pm$ 0.18	56.8 $\pm$ 1.7

Table 11: Ablation study En

\rightarrow

Fr. The best result is in bold and the second best result is underlined.

	Test2016		Test2017		COCO		CoMMuTE
	BLEU	COMET	BLEU	COMET	BLEU	COMET	Accuracy
ZeroMMT-600M	36.22 $\pm$ 0.40	83.04 $\pm$ 0.39	33.11 $\pm$ 0.68	82.54 $\pm$ 0.17	29.04 $\pm$ 0.13	77.72 $\pm$ 0.16	55.7 $\pm$ 0.3
w/o VMLM	37.17 $\pm$ 0.16	83.59 $\pm$ 0.08	33.72 $\pm$ 0.21	83.10 $\pm$ 0.09	28.14 $\pm$ 0.38	78.53 $\pm$ 0.21	50.0 $\pm$ 0.0
w/o KL	12.42 $\pm$ 5.76	67.10 $\pm$ 5.04	07.92 $\pm$ 4.36	64.92 $\pm$ 4.64	08.39 $\pm$ 4.17	61.32 $\pm$ 4.28	56.8 $\pm$ 1.1
+ MMT w/o KL	35.95 $\pm$ 0.51	82.58 $\pm$ 0.10	32.72 $\pm$ 0.31	82.02 $\pm$ 0.11	27.40 $\pm$ 0.33	77.13 $\pm$ 0.02	54.6 $\pm$ 0.6

Table 12: Ablation study En

\rightarrow

De. The best result is in bold and the second best result is underlined.

	Test2016		Test2018		CoMMuTE
	BLEU	COMET	BLEU	COMET	Accuracy
ZeroMMT-600M	25.66 $\pm$ 0.43	84.27 $\pm$ 0.36	24.82 $\pm$ 0.49	83.32 $\pm$ 0.14	55.5 $\pm$ 0.5
w/o VMLM	26.49 $\pm$ 0.20	85.17 $\pm$ 0.07	26.55 $\pm$ 0.03	84.34 $\pm$ 0.07	50.1 $\pm$ 0.2
w/o KL	10.90 $\pm$ 5.49	71.97 $\pm$ 5.42	08.15 $\pm$ 4.14	67.27 $\pm$ 5.68	59.1 $\pm$ 0.8
+ MMT w/o KL	25.10 $\pm$ 0.27	83.78 $\pm$ 0.20	25.43 $\pm$ 0.10	82.66 $\pm$ 0.10	54.8 $\pm$ 0.9

Table 13: Ablation study En

\rightarrow

Cs. The best result is in bold and the second best result is underlined.

	EMMT		VATEX		CoMMuTE
	BLEU	COMET	BLEU	COMET	Accuracy
ZeroMMT-600M	14.12 $\pm$ 0.09	52.39 $\pm$ 0.07	17.36 $\pm$ 0.13	61.82 $\pm$ 0.12	58.2 $\pm$ 1.1
w/o VMLM	15.19 $\pm$ 0.27	53.60 $\pm$ 0.11	17.40 $\pm$ 0.10	61.95 $\pm$ 0.12	50.1 $\pm$ 0.2
w/o KL	01.12 $\pm$ 4.86	43.30 $\pm$ 1.06	08.99 $\pm$ 4.27	51.01 $\pm$ 5.38	60.7 $\pm$ 0.5
+ MMT w/o KL	11.89 $\pm$ 0.44	51.56 $\pm$ 0.63	16.70 $\pm$ 0.18	62.16 $\pm$ 0.24	56.5 $\pm$ 0.5

Table 14: Ablation study En

\rightarrow

Zh. The best result is in bold and the second best result is underlined.

Appendix B Additional examples

Figures 5(a), 5(b), 5(c), 5(d), 5(e) and 5(f) show additional translation examples from CoMMuTE by ZeroMMT (Ours) and the text-only NLLB-600M distilled model.