¹¹institutetext: University of Modena and Reggio Emilia, via G.Campi 213/b, 41125, Modena, Italy ¹¹email: [email protected]

Bidirectional Awareness Induction in Autoregressive Seq2Seq Models

Jia Cheng Hu 11 0009-0008-1611-966X Roberto Cavicchioli 11 0000-0003-0166-0898 Alessandro Capotondi 11 0000-0001-8705-0761

Abstract

Autoregressive Sequence-To-Sequence models are the foundation of many Deep Learning achievements in major research fields such as Vision and Natural Language Processing. Despite that, they still present significant limitations. For instance, when errors occur in the early steps of the prediction, the whole output is severely affected. Such reliance on previously predicted tokens and the inherent computational unfriendliness of sequential algorithms, motivated researchers to explore different architectures and methods in the search for bidirectional approaches. In this work, we introduce the Bidirectional Awareness Induction (BAI), a training method that leverages a subset of elements in the network, the Pivots, to perform bidirectional learning without breaking the autoregressive constraints. To showcase its flexibility, we apply the method to three architectures, the Transformer, ExpansionNet v2 and GPT, then perform experiments over three tasks. Experimental results showcase BAI’s effectiveness on all selected tasks and architectures. In particular, we observed an increase of up to 2.4 CIDEr in Image-Captioning, 4.96 BLEU in Neural Machine Translation, and 1.16 ROUGE in Text Summarization compared to the respective baselines. Notably, BAI not only has a positive impact on models trained from scratch but on pre-trained models as well. Such an aspect, combined with the absence of architectural requirements synergizes well with the current trend of LLMs.

Keywords:

Autoregressive Bidirectional Sequence-to-Sequence.

1 Introduction

Many tasks in Natural Language Processing (NLP) such as Neural Machine Translation (NMT) [34, 7], Text Summarization (TS) [3, 2, 11] and Image Captioning (IC) [30, 8] deal with the challenging task of generating meaningful and linguistically correct sentences. This is commonly accomplished by Neural Networks. Typically, models follow the Autoregressive property, meaning that the token distribution predicted on time step $t$ , depends on all the previous tokens from 1 to $t-1$ . While this approach is intuitive, as the sequential process resembles on a superficial level, how humans communicate, it poorly reflects how we process information, and presents in fact, some limitations. If errors in previous predictions occur, the quality of subsequent predictions is undermined. Additionally, unidirectional decoding fails to capture bidirectional contexts that can be exploited for more effective learning.

Several works in Natural Language Processing (NLP) related fields such as Image Captioning (IC) and Neural Machine Translation (NMT) proposed several approaches to combat the limitations of unidirectional decoding. Existing methods can be categorized into two main non-exclusive classes we name for simplicity "architecture-based" and "algorithmic-based". The first [35, 34, 37, 39] consists of feeding Right-to-Left (R2L) data (or processing the input in a reversed order) to the network, in addition to the standard Left-to-Right (L2R) which often imply also architectural modifications. These methods lead to better performances at the expense of a higher computational cost. The second consists of training and algorithmic modifications [32, 6, 20, 26] which do not focus on the architecture but propose a different framework to predict multiple tokens simultaneously, often at the cost of the final accuracy. In this category, fall the very recent works of Text Diffusion models [18, 13]. These methods focus on predicting the result in one single or multiple parallel passages in contrast to the standard autoregressive models.

Architecture-based strategies are very effective in integrating the R2L processing in the model but typically require modifications in the architecture, which means that running the model is typically more time-consuming and cannot be easily extended to Large Pre-trained Models. The opposite was generally true (with some exceptions [26]) in the case of algorithm-based strategies where lowering the inference cost was the most desirable effect at the expense of a negligible accuracy degradation. Finally, Text Diffusion models seem to be a promising direction which offers both low latency and satisfying output quality. However, they still present limiting factors, such as a high test-training discrepancy [28] or the dependency on additional components such as length classifiers [18] and, overall, autoregressive decoding still represents the most popular and solid approach in modern applications.

In this work, we introduce a proposal that shares the same purpose as architectural-based methods and aims at achieving better performance. However, it operates only during the training and does not require architecture modifications, which can be ideal in the case of pre-trained models [21, 4, 3]. In particular, we observe that not all parts of a typical encoder-decoder network are subjected to the auto-regressive constraint and introduce the concept of Pivots. That is, there are elements in the network that are allowed to access and be trained on the entire target sequence, and we leverage them to induce bidirectional awareness in the network.

Overall, the paper is organized as follows. First, we introduce the concept of pivots and Bidirectional Awareness Induction training. We showcase its application in three architectural instances. Then, we describe the experimental setup and showcase the results in three tasks. Afterwards, we compare BAI with other methods and analyze the quality of pivots. Finally, we discuss the limitations and present the conclusion and future works. The contributions are the following:

1.

We introduce the concept of Pivots, described as network elements on which we perform training on tasks that can be beneficial but are not directly related, to the final objective;
2.

Leveraging the concept of pivots, we introduce the Bidirectional Awareness Induction strategy, which trains pivots on a bidirectional loss without breaking the autoregressive property of the model, preserving the advantages of the two worlds.
3.

We showcase our method’s flexibility and robustness over various architectures, tasks and training setups. Notably, our method works on both pre-trained models and models that are trained from scratch.

2 Related Works

Several approaches have been proposed over the past years to combat the limitations of unidirectional decoding, mostly in NMT and IC. Since the decoding stage in these two applications is similar and solutions are often interchangeable, in this Section we report related works in both fields.

In NMT, the works of [15] and [35] trained two models for the L2R and the R2L decoding respectively and combined their results during inference time they proposed the joint search, an alternative to the beam search. [26] proposed the Bidirectional Beam Search [32] implemented a Semi-Autoregressive architecture that decodes multiple tokens at each step. [36] proposed separated layers for the past and future representations in an RNN-based decoder. In the work of [20] they propose an alternative decoding order that starts from the middle of the sequence, in contrast to the standard L2R and R2L decoding. Whereas, [38] completed the idea using the complementary approach of ordering the decoding stage from the sides to the middle. [34, 27] introduced the "Asynchronous" bidirectional decoding based on two RNNs, trained for L2R and R2L decoding respectively. First, one decoder produces the R2L sequence, then the second RNN leverages the first result during the generation of the L2R sequence. [37] in contrast, introduces the "Synchronous" version, in which both L2R and R2L are generated simultaneously. Another notable approach is represented by NAT [6, 7, 5] the Non-Autoregressive Architecture whose main principle consists of replicating the source embeddings several times (according to the so-called "fertility") in the decoder, so the latter can perform parallel and bidirectional processing.

In IC, the work of [31] first proposed the adoption of Bi-LSTM in the field. In [22] and [23] the authors adopted an auxiliary network to perform editing operations on the final result, which mitigates the limitations of the autoregressive decoding. In CAAG [25] the authors propose two models, a primary network that generates a caption greedily. A second one that leverages the first prediction to look at both the past and future and perform a joint beam search. Both predictions are then combined in the generation of the final description. In CBTIC [39] the authors augment the Transformer decoder by designing a particular architecture to integrate also the R2L data in the network.

Text Diffusion models [18, 13] represent a recent and promising emerging family diffusion-based generative models for text. They are capable of performing bidirectional processing and prediction, however, they currently suffer from non-negligible issues, such as those related to the test-train discrepancy. Our work is orthogonal to these approaches and focuses on autoregressive models, still being the predominant methodology in sequence-to-sequence problems.

Refer to caption — Figure 1: Pivot selection in case of Transformer [29], ExpansionNet v2 [8] and GPT-2 [21] architectures. Pivots are highlighted in red colour, processing layers are depicted in blue.

3 Method

In this Section, we present the concept of Pivot and the main aspects of our Bidirectional Awareness Induction (BAI), from a conceptual point of view. Then, we apply it to several architectural instances.

3.1 Bidirectional Awareness Induction (BAI)

Our method, called Bidirectional Awareness Induction (BAI), is based on the concept of Pivots elements. We define Pivots as elements of the network that can be trained on tasks that are not necessarily correlated to the final objective function but can be beneficial to improve the quality of the result. In this work, we train pivots to reproduce the target output in Seq2Seq problems and they are selected such that the auto-regressive property of the network is preserved. In this way, we induce bidirectional awareness in auto-regressive models or relax the prediction dependency on the previously generated tokens. Ultimately, BAI intends to combine the best of two worlds without inheriting the obstacles of bidirectional models. Overall, BAI can be broken down into three steps: (i) Pivot Selection. Select a set of elements to be trained on the bidirectional task (e.g. encoder features). (ii) Length Equalization. Leverage decoder representations only to equalize the pivot elements to the one of the decoder. (iii) Decoder sequence reconstruction. Reconstruct the decoder sequence using exclusively the result of the previous step. Train the model to optimize the reconstruction discrepancy and the task-specific auto-regressive loss jointly. The encoder features typically represent the most straightforward selection of pivots since they can be leveraged in the learning without breaking the auto-regression condition. However, the concrete implementation of the strategy depends on the adopted architecture, but notably, BAI does not require architectural modifications.

We highlighted in the third step that the BAI loss function is intended to be jointly optimized with the Cross-Entropy training. In this way, the network can benefit from an increased bidirectional awareness, without harming the effectiveness of the traditional approach.

In our experiments, to showcase the flexibility of BAI we present in the following sections the application of our proposal to different architectures, such as the popular and established Transformer [29], the recent ExpansionNet v2 [8], and GPT-2 [21]. To study the robustness of the idea we performed experiments in multiple tasks such as Neural Machine Translation, Text Summarization, and Image Captioning.

3.2 BAI in Transformer

In this Section, we propose a concrete implementation of the previous idea in the case of the Transformer [29]. The Transformer is a popular Encoder-Decoder architecture that succeeded in numerous NLP tasks. In essence, all layers are made of Self-Attention and FeedForward layers and their exact implementation is omitted since they do not impact the discussion.

We select the output of the last encoder layer $\overline{E}$ = $\{e_{1},e_{2},\ldots,e_{N}\}$ to be the pivots, which consists of $N$ feature vectors of size $H$ . During the second step, the length equalization step, we perform the dot product similarity between $\overline{E}$ and the target sequence embeddings $D\in\mathbb{R}^{M\times H}$ where $M$ is the target length. Then apply Softmax and multiply the result by $\overline{E}$ :

R=Softmax(\frac{(D\overline{E}^{\intercal})}{\sqrt{H}})\overline{E}

(1)

Note that although it involves the embeddings $D$ the auto-regressive property is preserved. The final BAI loss is defined as the Mean Square Loss (MSE) between the results of Equation 1 and $D$ :

\beta(R,D)=\frac{1}{M}\sum_{t}^{M}\frac{1}{H}(r_{t}-d_{t})^{\intercal}(r_{t}-d% _{t})

(2)

where $r_{t}\in\mathbb{R}^{H\times 1}$ denotes the element $t$ -th element of $R\in\mathbb{R}^{N\times H}$ , and $d_{t}\in\mathbb{R}^{H\times 1}$ defines the $t$ -th target embedding. During the training, Equation 2 is jointly trained with the standard Cross-Entropy (CE):

CE+BAI(Y,X)=\lambda\beta(Embed(Y),X)+\sum_{t=1}^{|M|}log\ p(y_{t}|y_{i<t},X)

(3)

where $X$ and $Y$ denote the input and target sequences respectively. $Embed(\cdot)$ denote the embedding function, and $\lambda$ is a configurable hyper-parameter.

3.3 BAI in ExpansionNet v2

ExpansionNet v2 [8] is another Encoder-Decoder architecture developed for Image Captioning. The encoder implements the (Block) Static Expansion, which distributes the input content across a group of arbitrary numbers of vectors, denoted by $G=\{g_{1},g_{2}\ldots,g_{|G|}\}$ , in the Forward Expansion. The result is called "expanded sequence" and is processed again in the Backward Expansion to retrieve the original sequence length. We refer to the original paper for the architectural details.

In the BAI’s Pivot selection step, instead of adopting directly the last encoder outputs, we adopt two intermediate results. Let $A^{g}_{l},B^{g}_{l}\in\mathbb{R}^{g\times H}$ for $g\in G$ , the results of the Forward Expansion in the two operation paths, where $g\in G$ denotes the expanded sequence length and $l$ identifies the layer. In particular, we compute $\overline{A}^{g}=\sum_{l}^{L}{A}^{g}_{l}$ and $\overline{B}^{g}_{l}=\sum_{l}^{L}{B}^{g}_{l}$ .

In the length equalization step we replicate a parameter-less version of the Backward Expansion:

$\displaystyle R^{g}_{1}$	$\displaystyle=\phi(ReLU(\frac{(D((\overline{A}^{g}+\overline{B}^{g})/2)^{% \intercal})}{\sqrt{H}}))\ \ \ \ \ for\ g\in G$	(4)
$\displaystyle R^{g}_{2}$	$\displaystyle=\phi(ReLU(-\frac{(D((\overline{A}^{g}+\overline{B}^{g})/2)^{% \intercal})}{\sqrt{H}}))\ \ \ \ \ for\ g\in G$
$\displaystyle R_{1}$	$\displaystyle=[R_{1}^{g_{1}},R_{1}^{g_{2}},\ldots,R_{1}^{g_{\|G\|}}]_{2},\ \ R_{% 2}=[R_{2}^{g_{1}},R_{2}^{g_{2}},\ldots,R_{2}^{g_{\|G\|}}]_{2}$
$\displaystyle\hat{A}$	$\displaystyle=[\overline{A}^{g_{1}},\overline{A}^{g_{2}},\ldots,\overline{A}^{% g_{\|G\|}}]_{1},\ \ \hat{B}=[\overline{B}^{g_{1}},\overline{B}^{g_{2}},\ldots,% \overline{B}^{g_{\|G\|}}]_{1}$
$\displaystyle R$	$\displaystyle=(\frac{R_{1}\hat{A}}{\|G\|}+\frac{R_{2}\hat{B}}{\|G\|})/2$

where $\phi$ denotes a row-wise normalization function [8] and $[\bullet,\bullet,\ldots,\bullet]_{n}$ denotes the concatenation over the $n$ -th axis.

The third step of $BAI$ is equivalent to the Equation 2 the case described in Section 3.2. We define the BAI loss as the MSE between the result of Equation 4 and $D$ and perform joint optimization with the Cross-Enropy.

3.4 BAI in Large Pre-Trained Models

Most popular large pre-trained models are based on the Transformer architecture, such as Flan-T5 [3], mBART [17], and GPT-2 [21]. For this reason, BAI can be applied in the same way as described in Section 3.2 with few exceptions.

In the case of GPT-2, the architecture differs compared to the vanilla Transformer because of the absence of the encoder. In this case, we select the Pivot elements to be the final state of the input sequence.

4 Experimental Setup

To evaluate the generalizability of our method we evaluate BAI across three tasks and five architectures. For Neural Machine Translation (NMT) we adopt the Transformer [29] and mBART [17]. For Image Captioning (IC) we select the [8] and [29]. Finally, Flan-T-Small [3] and GPT-2 [21] are evaluated on Text Summarization (TS). The architecture of mBART, Flan-T5-Small, and GPT-2 are pre-trained according to the dataset and modalities reported in the respective works.

4.1 Dataset

A total of four datasets are adopted in the experiments. IC is evaluated on the Microsoft’s COCO 2014 [14] which is split according to Karpathy [10]. Overall, it consists of 113K training images, 5K images for the validation, and an additional 5K for the test. Each image is paired with five ground-truth captions. They are pre-processed by lower casing and punctuation removal. Words that occur less than five times are discarded, resulting in a total of 10K unique tokens. NMT is tested on the IWSLT 2015 English-Vietnamese (En-Vi) corpus, which consists of 133K sentences for training and 1268 for evaluation. Sequences whose post-tokenization length is greater than 150 are discarded. In all training instances, the target and the source language vocabulary are shared. Each vocabulary is created using the BPE algorithm [24]. TS is evaluated on the novel TIFU dataset [11], which consists of threads from the "TIFU" subreddit equipped with the "TL;DR" summary. We follow the Pegasus Split [33], the training set consists of $\sim$ 33K pairs, and 5K are reserved for testing. Additionally, we adopt the popular DialogSum [2], made of 12,460 training and 1500 testing dialogue-summary pairs. Since pre-trained models are adopted for these tasks, no additional pre-processing is applied to the dataset besides the subword tokenization defined in each respective model. The reason we adopted two datasets is motivated by the fact that, in early experiments, some pre-trained baseline models produced only trivial answers in the case of the TIFU dataset. As a result, they were unsuitable for experimentation and comparison. In contrast, all selected models achieved a satisfactory level of performance in the case of DialogSum. In both datasets, training, and testing samples are filtered according to the maximum sequence length supported by the pre-trained model.

Table 1: BAI augmented training results compared to the baselines, without BAI, on the MS-COCO test set. Metrics are denoted by M=METEOR, R=ROUGE, B=BLEU, S=SPICE and C=CIDEr-D.

\Delta

C denotes the difference in the main score CIDEr-D compared to the baseline "XE only" training.

Model	Training	B1	B2	B3	B4	R	M	S	C	$\Delta$ C
Transformer [29]	XE	76.0	59.9	46.3	35.6	57.4	29.0	22.2	119.4	0.0
Transformer	XE+BAI	76.4	60.6	47.0	36.2	57.8	29.3	22.5	120.6	+1.2
ExpansionNet v2 [8]	XE	77.6	62.0	48.2	37.2	58.1	29.4	22.5	123.5	0.0
ExpansionNet v2	XE+BAI	78.2	63.0	49.2	38.1	58.5	29.5	22.8	125.9	+2.4

Table 2: BAI augmented training results compared to the baselines in TS test datasets.

\Delta

R denotes the difference in the main score ROUGE compared to the baseline (XE). All models are pre-trained according to the modalities of the respective papers and then fine-tuned on the benchmark dataset.

Dataset	Model	Training	BLEU	ROUGE	$\Delta$ R
TIFU	Flan-T5-Small [3]	XE	20.56	32.29	0.0
TIFU	Flan-T5-Small	XE+BAI	21.96	33.45	+1.16
DialogSum	Flan-T5-Small [3]	XE	46.38	33.93	0.0
DialogSum	Flan-T5-Small	XE+BAI	46.21	34.81	+0.88
DialogSum	GPT-2 [21]	XE	39.99	24.74	0.0
DialogSum	GPT-2	XE+BAI	39.50	25.09	+0.35

4.2 Training and Models Details

The Transformer follows the architecture of the Base Transformer reported in [29] consisting of $N=6$ encoder and decoder layers, dimension $H=512$ , FeedForward size of $FF=2048$ and $8$ attention heads. ExpansionNet v2 follows the configuration of [8], it consists of the Swin-Transformer-Large backbone [19] and $N=3$ encoder and decoder layers. We adopt the Static expansion coefficients of $G=\{32,64,128,256,512\}$ in the encoder and a dynamic expansion coefficient of 16 in the decoder. The remaining hyper-parameters are the same as the Transformer. The Transformer will be adopted in the case of NMT and TS. ExpansionNet v2 will be deployed for IC. We refer to the original paper for the architecture details of Flan-T5-Small [3], mBART [17] and GPT-2 [21].

For all three problems, the standard Cross-Entropy (CE) loss is adopted. As described in Equation 3 we jointly train CE and BAI using a parameter $\lambda$ which weights the contribution of the BAI loss. For each task, we report the batch size, batching criteria, optimizer, learning rate strategy, and the hyper-parameters of the function $\Lambda:t\rightarrow[0,1]\subset\mathbb{R}$ which provides the BAI weight $\lambda=\Lambda(t)$ in the iteration $t$ . We define $\Lambda$ as follows:

\Lambda(t)=\eta+\frac{1}{1+e^{-(t/T-\phi)/\gamma}}(1-\eta)

(5)

where $T$ denotes the number of iterations in one epoch, $\eta$ denote the minimum weight, $\gamma$ determines the velocity of the weight increase and $\phi$ controls the base of the slope.

IC Training. RAdam [16] optimizer with $\beta_{1}=0.9,\beta_{2}=0.98$ , batch size of 48, random batching criteria, an initial learning rate of 2e-4, warmed up for 10000 iterations, then annealed by 0.8 every 2 epochs for 20 epochs. $\Lambda$ is defined by $\eta$ =1e-3, $\gamma$ =0.5 and $\phi$ =15 in the case of ExpansionNet v2. When reported, in the case of the Transformer is configured as $\eta$ =1e-7, $\gamma$ =2.0 and $\phi$ =24.

NMT Training. Adam optimizer [12] with $\beta_{1}=0.9$ and $\beta_{2}=0.98$ , batching criteria based on the source sentence length, token batch size of 4096 and the Noam learning rate as described in [29] with 4000 warm-up steps, trained for 300 epochs. The BAI weight function $\Lambda$ is characterized by $\eta$ =1e-12, $\gamma$ =0.1 and $\phi$ =250 for models trained from scratch. Pre-trained models are fine-tuned for 3 epochs and $\phi$ is set to two.

TS Training. The same optimizer and learning rate for NMT is adopted. Trained for 2 epochs in TIFU and 10 epochs in DialogSum. A batch size of 2048 and no warming steps. $\Lambda$ is defined by $\eta$ =1e-4, $\gamma$ =0.5 and $\phi$ =0.5 for Flan-T5-Small. Whereas $\eta$ =1e-12, $\gamma$ =0.1 and $\phi$ =12 are the configurations for GPT-2.

While the general learning rate of the training follows the convention of the original papers, the definition of the $\Lambda$ function was chosen according to empirical rules formulated during the experimental stage. More details can be found in Section 5.3.

4.3 Evaluation Methods

In IC, the model is evaluated on the standard evaluation metrics of CIDEr-D , METEOR , SPICE , ROUGE , BLEU . In TS we adopt the BLEU and ROUGE scores. For NMT we use the BLEU score only. Beam Search is adopted during inference, with a beam width of 3 in case of IC, and 4 in the case of NMT and TS. In the case of pre-trained models, greedy decoding is performed.

5 Results

In this Section, we observe the effectiveness of BAI according to different tasks and architectures. We first present the impact of BAI in the Transformer and ExpansionNet v2 when compared with the standard Cross-Entropy training and some related works. Then, we will showcase additional analysis to assess the impact of BAI besides standard benchmark metrics. Following, we discuss the impact of the importance of the weight term in the result of BAI. Finally, we showcase some qualitative results.

5.1 BAI Results

Tables 1, 2, and 3 report the performance improvements generated by our proposed method against the baselines, in IC, NMT, and TS.

Tables 1 and 2 showcase the performance improvements across standard Image Captioning metrics and Text Summarization. It can be observed that, both Transformer and ExpansionNet v2 benefit from our method, with an increase of 1.2 CIDEr-D in the first, and 2.4 CIDEr-D in the latter. Whereas, Flan-T5-Small and GPT-2 reported an increase of 0.88 and 0.35 ROUGE compared to the respective baselines. These results indicate the robustness and flexibility of our approach to different architectures, in particular, the improvement can be seen regardless of the model and pivot selection. In Table 1 a difference in the improvements of 1.2 CIDEr-D can be observed between the two architectures. This suggests that the method is sensible to the architecture or a proper selection of pivots. However, this characteristic can be regarded as a strength since BAI proved to be effective in the case of the most popular Seq2Seq model, the Transformer, but new architectures and strategies can be developed to benefit more from our approach.

In Table 3 we report the difference in BLEU score for the NMT task. It can be observed that augmenting the Cross-Entropy with BAI is beneficial in both situations, when the model is trained from scratch, represented by the base Transformer, and when it is done during the fine-tuning of a multi-lingual large pre-trained model mBART, when the standard Cross-Entropy is augmented with BAI. This aspect is also supported by Table 2, which showcase the performance improvements also in the case of the pre-trained models Flan-T5-Small and GPT-2.

Table 3: BAI augmented training results on the IWSLT15 En-Vi test set.

\Delta

BLEU denotes the difference compared to the respective Cross-Entropy-only result. The star symbol ^⋆ denotes the model is pre-trained and fine-tuned, otherwise the model is trained from scratch.

Model	Training	BLEU	$\Delta$ BLEU
Transformer [29]	XE	31.13	0.0
Transformer	XE+BAI	31.62	+0.49
mBART^⋆ [17]	XE	44.37	0.0
mBART^⋆	XE+BAI	49.33	+4.96

In Table 3 a significant difference in the improvements can be noted in the pre-trained model compared to the one trained from scratch. We hypothesize this is caused by the fact that the BAI training appears to be the most effective when the weight term is low at the early stages of the training but high at the final epochs. This situation is perfectly recreated by the pre-training and fine-tuning of mBART, where the BAI’s contribution is virtually zero during the pre-training, and it is introduced only when the model reaches the performance plateau. More details can be found in Section 5.3.

Table 4: Impact of different training strategies using L2R and R2L data and the comparison against BAI.

Training method	B1	B4	R	M	S	C
L2R XE (Baseline)	77.1	36.6	58.0	29.4	22.6	122.7
R2L XE	77.0	36.0	57.7	29.0	20.9	122.5
R2L XE then L2R XE	75.4	35.2	57.2	22.5	29.3	120.1
L2R+R2L XE	76.0	35.5	57.3	22.4	29.1	119.3
L2R XE + BAI	78.2	38.1	58.5	29.5	22.8	125.9
R2L XE + BAI	78.0	37.7	58.1	29.3	22.8	124.3

5.2 BAI Against Other Bidirectionality Approaches

While early works on bidirectionality focused on recurrent architectures, in the last years most techniques were developed for the Transformer in different applications. In this regard, we compare our proposal to NAT [6] (NMT), and CBTIC [39] (for IC) as two representatives of the algorithmic-based work on bidirectional and architecture-based approach for bidirectionality respectively. We train our models using the configurations reported in Section 4.2.

For completeness, we first showcase the ineffectiveness of inducing both Left-to-Right (L2R) and Right-to-Left (R2L) knowledge in the model in simple ways, i.e. without architectural or algorithmic incentives. For example, we use the ExpansionNet architecture for the image captioning task and adopt the training configuration described in Section 4.2. In Table 4 we observe that combining L2R and R2L data with simple strategies does not lead to improvements and sometimes can even harm the performances. In contrast, BAI improves the output quality regardless of the L2R or R2L¹¹1Training on R2L data leads to artefacts described in [9] regardless of BAI. approach training with an increase of 3.2 and 1.8 CIDEr-D in the two cases.

In Table 5, we compare our method against the CBTIC architecture. In particular, we re-create the CBTIC environment by training captioning models on Faster-RCNN features as described in [1]. Both the architectures of Transformer and ExpansionNet v2, shortened respectively as "Transf." and "ExpV2", significantly benefit from BAI, confirming its robustness to different visual features. In particular, in the "ExpV2" case, BAI can increase the initial score to 3.4 CIDEr compared to the 2.7 increase observed in the CBTIC architecture. Whereas, in the case of Transformer, BAI led to an increase of 17.8. However, the magnitude of such an improvement in the latter case is an outlier that should not be attributed to BAI alone, but rather an unfavourable experimental training setup for the baseline, since it was designed for ExpansionNet v2.

Compared to CBTIC, BAI improves performance without requiring architectural changes, which is reflected in the constant amount of FLOPs with or without BAI. This aspect can be particularly appreciated in the case of pre-trained models, where architectural modifications can be very time-consuming since they would require the re-training of the foundation model.

In Table 6, we showcase the comparison between our proposal and NAT, in the case of NMT. We re-create the experimental setup of the latter by training from scratch the Transformer architecture presented in Section 4.2 on the IWSLT16 En-De dataset. Here, it can be observed that our method improves the BLEU score by 0.3 without changing the model inference speed whereas, NAT, in the vanilla formulation, focuses on reducing the inference time at the cost of slight degradation of performances.

Table 5: Comparison between BAI and CBTIC performances on the MS-COCO validation set. Models are trained with Cross-Entropy and Up-Down features [1].

\delta

denotes the CIDEr-D improvement against the respective baselines. FLOP computation assumes a caption length of 20 and 36 visual features include the backbone.

Method	B4	R	M	C ( $\delta$ )	FLOPs
ExpV2	34.5	56.2	28.0	112.9 (0.0)	8.8 G
ExpV2 w/ BAI	35.3	56.8	28.2	116.3 (+3.4)	8.8 G
Transf.	28.1	51.6	25.1	95.7 (0.0)	7.63 G
Transf. w/ BAI	34.8	56.6	28.3	113.5 (+17.8)	7.63 G
Transf. [39]	35.4	56.3	27.8	111.7 (0.0)	7.63 G
CBTIC [39]	35.6	56.8	28.1	114.4 (+2.7)	8.97 G

Table 6: Comparison between BAI and NAT in the IWSLT16 En-De.

\delta

denotes the difference compared to the respective baseline value. NVIDIA Tesla P100 was used in [6]. Our experiments are performed on NVIDIA A100.

Method	BLEU $(\delta)$	Latency
Transformer	29.83 (0.0)	71 ms
Transformer w/ BAI	30.13 (+0.3)	71 ms
Transformer [39]	29.70 (0.0)	607 ms
NAT [39]	28.16 (-1.54)	257 ms

5.3 BAI Weight Term Impact

In this Section, we showcase the impact of the Weight Term choice on the effectiveness of BAI. Although we focused on the case of Image Captioning as the application example for the analysis, similar behaviours were observed in the other selected tasks.

In Figure 2 we plot the weight term function defined in Section 4.2 compared to other hand-crafted functions. The alternative weight functions were designed to represent orthogonal criteria such as constant weight, increasing weight, and decreasing weight.

In Figure 2.a-c) it can be observed that BAI leads to different results according to the weight term function. For instance, a large term that decreases over time ( $\Lambda$ 5) seems to introduce too much noise to the standard Cross-Entropy learning. However, the problem does not lie in the initial magnitude of the term, as can be seen in the contrary case, $\Lambda$ 4, where the term increases over the epochs and improves the final CIDEr-D score. In general, the magnitude alone does not suggest a predictable pattern, since the cases of $\Lambda$ 1, $\Lambda$ 2, and $\Lambda$ 3 produced mixed results. Overall, the effectiveness of BAI depends on the weight function and it can be beneficial or detrimental depending on the design of the latter. Fortunately, in the case of $\Lambda^{*}$ , we observed that the practice of keeping the term small in the early stages and increasing its magnitude when the model reaches the plateau in the Cross-Entropy loss (Figure 2-b) empirically leads to the best result. This motivated the design of the weight term function for all three tasks in Section 5.1.

6 Conclusion and Future Works

In this work, we tackled the problem of integrating bidirectional awareness into Seq2Seq auto-regressive models. To do so, we first introduce the concept of pivots, defined as network elements that can be trained on auxiliary losses that do not necessarily correlate to the one required by the task but can be helpful in a better accomplishment of the latter. Leveraging this concept, we introduce the Bidirectional Awareness Induction (BAI) and train pivots over a bidirectional loss without breaking the auto-regressive property. The practice appears to increase the quality of the intermediate representations and experimental results involving three architectures, the Transformer, ExpansionNet v2 and GIT and three tasks, such as Neural Machine Translation, Text Summarization and Image Captioning, showcase our method’s robustness, effectiveness, and flexibility. Future experiments will focus on addressing the current limitations and explore the training of pivots with additional auxiliary losses, beyond the bidirectional one proposed in this work.

References

[1] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6077–6086 (2018)
[2] Chen, Y., Liu, Y., Chen, L., Zhang, Y.: DialogSum: A real-life scenario dialogue summarization dataset. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 5062–5074. Association for Computational Linguistics, Online (Aug 2021). https://doi.org/10.18653/v1/2021.findings-acl.449, https://aclanthology.org/2021.findings-acl.449
[3] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
[4] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
[5] Ding, L., Wu, D., Tao, D.: Improving neural machine translation by bidirectional training. arXiv preprint arXiv:2109.07780 (2021)
[6] Gu, J., Bradbury, J., Xiong, C., Li, V.O., Socher, R.: Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281 (2017)
[7] Gu, J., Kong, X.: Fully non-autoregressive neural machine translation: Tricks of the trade. arXiv preprint arXiv:2012.15833 (2020)
[8] Hu, J.C., Cavicchioli, R., Capotondi, A.: Exploiting multiple sequence lengths in fast end to end training for image captioning. In: 2023 IEEE International Conference on Big Data (BigData). pp. 2173–2182. IEEE Computer Society (2023)
[9] Hu, J.C., Cavicchioli, R., Capotondi, A.: A request for clarity over the end of sequence token in the self-critical sequence training. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds.) Image Analysis and Processing – ICIAP 2023. pp. 39–50. Springer Nature Switzerland, Cham (2023)
[10] Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3128–3137 (2015)
[11] Kim, B., Kim, H., Kim, G.: Abstractive Summarization of Reddit Posts with Multi-level Memory Networks. In: NAACL-HLT (2019)
[12] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017)
[13] Li, X., Thickstun, J., Gulrajani, I., Liang, P.S., Hashimoto, T.B.: Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems 35, 4328–4343 (2022)
[14] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)
[15] Liu, L., Utiyama, M., Finch, A., Sumita, E.: Agreement on target-bidirectional neural machine translation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 411–416 (2016)
[16] Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Han, J.: On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265 (2019)
[17] Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L.: Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, 726–742 (2020)
[18] Liu, Y., Yang, T., Huang, S., Zhang, Z., Huang, H., Wei, F., Deng, W., Sun, F., Zhang, Q.: Text diffusion with reinforced conditioning. arXiv preprint arXiv:2402.14843 (2024)
[19] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022 (2021)
[20] Mehri, S., Sigal, L.: Middle-out decoding. Advances in Neural Information Processing Systems 31 (2018)
[21] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
[22] Sammani, F., Elsayed, M.: Look and modify: Modification networks for image captioning. arXiv preprint arXiv:1909.03169 (2019)
[23] Sammani, F., Melas-Kyriazi, L.: Show, edit and tell: a framework for editing image captions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4808–4816 (2020)
[24] Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)
[25] Song, Z., Zhou, X., Mao, Z., Tan, J.: Image captioning with context-aware auxiliary guidance. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 2584–2592 (2021)
[26] Sun, Q., Lee, S., Batra, D.: Bidirectional beam search: Forward-backward inference in neural sequence models for fill-in-the-blank image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6961–6969 (2017)
[27] Tan, Z., Wang, S., Yang, Z., Chen, G., Huang, X., Sun, M., Liu, Y.: Neural machine translation: A review of methods, resources, and tools. AI Open 1, 5–21 (2020)
[28] Tang, Z., Wang, P., Zhou, K., Li, J., Cao, Z., Zhang, M.: Can diffusion model achieve better performance in text generation? bridging the gap between training and inference! (2023)
[29] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)
[30] Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3156–3164 (2015)
[31] Wang, C., Yang, H., Meinel, C.: Image captioning with deep bidirectional lstms and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14(2s), 1–20 (2018)
[32] Wang, C., Zhang, J., Chen, H.: Semi-autoregressive neural machine translation. arXiv preprint arXiv:1808.08583 (2018)
[33] Zhang, J., Zhao, Y., Saleh, M., Liu, P.J.: Pegasus: Pre-training with extracted gap-sentences for abstractive summarization (2019)
[34] Zhang, X., Su, J., Qin, Y., Liu, Y., Ji, R., Wang, H.: Asynchronous bidirectional decoding for neural machine translation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
[35] Zhang, Z., Wu, S., Liu, S., Li, M., Zhou, M., Xu, T.: Regularizing neural machine translation by target-bidirectional agreement. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 443–450 (2019)
[36] Zheng, Z., Zhou, H., Huang, S., Mou, L., Dai, X., Chen, J., Tu, Z.: Modeling past and future for neural machine translation. Transactions of the Association for Computational Linguistics 6, 145–157 (2018)
[37] Zhou, L., Zhang, J., Zong, C.: Synchronous bidirectional neural machine translation. Transactions of the Association for Computational Linguistics 7, 91–105 (2019)
[38] Zhou, L., Zhang, J., Zong, C., Yu, H.: Sequence generation: From both sides to the middle. arXiv preprint arXiv:1906.09601 (2019)
[39] Zhou, Y., Hu, Z., Liu, D., Ben, H., Wang, M.: Compact bidirectional transformer for image captioning. arXiv preprint arXiv:2201.01984 (2022)