11institutetext: University of Modena and Reggio Emilia, via G.Campi 213/b, 41125, Modena, Italy 11email: [email protected]

Bidirectional Awareness Induction in Autoregressive Seq2Seq Models

Jia Cheng Hu 11 0009-0008-1611-966X    Roberto Cavicchioli 11 0000-0003-0166-0898    Alessandro Capotondi 11 0000-0001-8705-0761
Abstract

Autoregressive Sequence-To-Sequence models are the foundation of many Deep Learning achievements in major research fields such as Vision and Natural Language Processing. Despite that, they still present significant limitations. For instance, when errors occur in the early steps of the prediction, the whole output is severely affected. Such reliance on previously predicted tokens and the inherent computational unfriendliness of sequential algorithms, motivated researchers to explore different architectures and methods in the search for bidirectional approaches. In this work, we introduce the Bidirectional Awareness Induction (BAI), a training method that leverages a subset of elements in the network, the Pivots, to perform bidirectional learning without breaking the autoregressive constraints. To showcase its flexibility, we apply the method to three architectures, the Transformer, ExpansionNet v2 and GPT, then perform experiments over three tasks. Experimental results showcase BAI’s effectiveness on all selected tasks and architectures. In particular, we observed an increase of up to 2.4 CIDEr in Image-Captioning, 4.96 BLEU in Neural Machine Translation, and 1.16 ROUGE in Text Summarization compared to the respective baselines. Notably, BAI not only has a positive impact on models trained from scratch but on pre-trained models as well. Such an aspect, combined with the absence of architectural requirements synergizes well with the current trend of LLMs.

Keywords:
Autoregressive Bidirectional Sequence-to-Sequence.

1 Introduction

Many tasks in Natural Language Processing (NLP) such as Neural Machine Translation (NMT) [34, 7], Text Summarization (TS) [3, 2, 11] and Image Captioning (IC) [30, 8] deal with the challenging task of generating meaningful and linguistically correct sentences. This is commonly accomplished by Neural Networks. Typically, models follow the Autoregressive property, meaning that the token distribution predicted on time step t𝑡titalic_t, depends on all the previous tokens from 1 to t1𝑡1t-1italic_t - 1. While this approach is intuitive, as the sequential process resembles on a superficial level, how humans communicate, it poorly reflects how we process information, and presents in fact, some limitations. If errors in previous predictions occur, the quality of subsequent predictions is undermined. Additionally, unidirectional decoding fails to capture bidirectional contexts that can be exploited for more effective learning.

Several works in Natural Language Processing (NLP) related fields such as Image Captioning (IC) and Neural Machine Translation (NMT) proposed several approaches to combat the limitations of unidirectional decoding. Existing methods can be categorized into two main non-exclusive classes we name for simplicity "architecture-based" and "algorithmic-based". The first [35, 34, 37, 39] consists of feeding Right-to-Left (R2L) data (or processing the input in a reversed order) to the network, in addition to the standard Left-to-Right (L2R) which often imply also architectural modifications. These methods lead to better performances at the expense of a higher computational cost. The second consists of training and algorithmic modifications [32, 6, 20, 26] which do not focus on the architecture but propose a different framework to predict multiple tokens simultaneously, often at the cost of the final accuracy. In this category, fall the very recent works of Text Diffusion models [18, 13]. These methods focus on predicting the result in one single or multiple parallel passages in contrast to the standard autoregressive models.

Architecture-based strategies are very effective in integrating the R2L processing in the model but typically require modifications in the architecture, which means that running the model is typically more time-consuming and cannot be easily extended to Large Pre-trained Models. The opposite was generally true (with some exceptions [26]) in the case of algorithm-based strategies where lowering the inference cost was the most desirable effect at the expense of a negligible accuracy degradation. Finally, Text Diffusion models seem to be a promising direction which offers both low latency and satisfying output quality. However, they still present limiting factors, such as a high test-training discrepancy [28] or the dependency on additional components such as length classifiers [18] and, overall, autoregressive decoding still represents the most popular and solid approach in modern applications.

In this work, we introduce a proposal that shares the same purpose as architectural-based methods and aims at achieving better performance. However, it operates only during the training and does not require architecture modifications, which can be ideal in the case of pre-trained models [21, 4, 3]. In particular, we observe that not all parts of a typical encoder-decoder network are subjected to the auto-regressive constraint and introduce the concept of Pivots. That is, there are elements in the network that are allowed to access and be trained on the entire target sequence, and we leverage them to induce bidirectional awareness in the network.

Overall, the paper is organized as follows. First, we introduce the concept of pivots and Bidirectional Awareness Induction training. We showcase its application in three architectural instances. Then, we describe the experimental setup and showcase the results in three tasks. Afterwards, we compare BAI with other methods and analyze the quality of pivots. Finally, we discuss the limitations and present the conclusion and future works. The contributions are the following:

  1. 1.

    We introduce the concept of Pivots, described as network elements on which we perform training on tasks that can be beneficial but are not directly related, to the final objective;

  2. 2.

    Leveraging the concept of pivots, we introduce the Bidirectional Awareness Induction strategy, which trains pivots on a bidirectional loss without breaking the autoregressive property of the model, preserving the advantages of the two worlds.

  3. 3.

    We showcase our method’s flexibility and robustness over various architectures, tasks and training setups. Notably, our method works on both pre-trained models and models that are trained from scratch.

2 Related Works

Several approaches have been proposed over the past years to combat the limitations of unidirectional decoding, mostly in NMT and IC. Since the decoding stage in these two applications is similar and solutions are often interchangeable, in this Section we report related works in both fields.

In NMT, the works of [15] and [35] trained two models for the L2R and the R2L decoding respectively and combined their results during inference time they proposed the joint search, an alternative to the beam search. [26] proposed the Bidirectional Beam Search [32] implemented a Semi-Autoregressive architecture that decodes multiple tokens at each step. [36] proposed separated layers for the past and future representations in an RNN-based decoder. In the work of [20] they propose an alternative decoding order that starts from the middle of the sequence, in contrast to the standard L2R and R2L decoding. Whereas, [38] completed the idea using the complementary approach of ordering the decoding stage from the sides to the middle. [34, 27] introduced the "Asynchronous" bidirectional decoding based on two RNNs, trained for L2R and R2L decoding respectively. First, one decoder produces the R2L sequence, then the second RNN leverages the first result during the generation of the L2R sequence. [37] in contrast, introduces the "Synchronous" version, in which both L2R and R2L are generated simultaneously. Another notable approach is represented by NAT [6, 7, 5] the Non-Autoregressive Architecture whose main principle consists of replicating the source embeddings several times (according to the so-called "fertility") in the decoder, so the latter can perform parallel and bidirectional processing.

In IC, the work of [31] first proposed the adoption of Bi-LSTM in the field. In [22] and [23] the authors adopted an auxiliary network to perform editing operations on the final result, which mitigates the limitations of the autoregressive decoding. In CAAG [25] the authors propose two models, a primary network that generates a caption greedily. A second one that leverages the first prediction to look at both the past and future and perform a joint beam search. Both predictions are then combined in the generation of the final description. In CBTIC [39] the authors augment the Transformer decoder by designing a particular architecture to integrate also the R2L data in the network.

Text Diffusion models [18, 13] represent a recent and promising emerging family diffusion-based generative models for text. They are capable of performing bidirectional processing and prediction, however, they currently suffer from non-negligible issues, such as those related to the test-train discrepancy. Our work is orthogonal to these approaches and focuses on autoregressive models, still being the predominant methodology in sequence-to-sequence problems.

Refer to caption
Figure 1: Pivot selection in case of Transformer [29], ExpansionNet v2 [8] and GPT-2 [21] architectures. Pivots are highlighted in red colour, processing layers are depicted in blue.

3 Method

In this Section, we present the concept of Pivot and the main aspects of our Bidirectional Awareness Induction (BAI), from a conceptual point of view. Then, we apply it to several architectural instances.

3.1 Bidirectional Awareness Induction (BAI)

Our method, called Bidirectional Awareness Induction (BAI), is based on the concept of Pivots elements. We define Pivots as elements of the network that can be trained on tasks that are not necessarily correlated to the final objective function but can be beneficial to improve the quality of the result. In this work, we train pivots to reproduce the target output in Seq2Seq problems and they are selected such that the auto-regressive property of the network is preserved. In this way, we induce bidirectional awareness in auto-regressive models or relax the prediction dependency on the previously generated tokens. Ultimately, BAI intends to combine the best of two worlds without inheriting the obstacles of bidirectional models. Overall, BAI can be broken down into three steps: (i) Pivot Selection. Select a set of elements to be trained on the bidirectional task (e.g. encoder features). (ii) Length Equalization. Leverage decoder representations only to equalize the pivot elements to the one of the decoder. (iii) Decoder sequence reconstruction. Reconstruct the decoder sequence using exclusively the result of the previous step. Train the model to optimize the reconstruction discrepancy and the task-specific auto-regressive loss jointly. The encoder features typically represent the most straightforward selection of pivots since they can be leveraged in the learning without breaking the auto-regression condition. However, the concrete implementation of the strategy depends on the adopted architecture, but notably, BAI does not require architectural modifications.

We highlighted in the third step that the BAI loss function is intended to be jointly optimized with the Cross-Entropy training. In this way, the network can benefit from an increased bidirectional awareness, without harming the effectiveness of the traditional approach.

In our experiments, to showcase the flexibility of BAI we present in the following sections the application of our proposal to different architectures, such as the popular and established Transformer [29], the recent ExpansionNet v2 [8], and GPT-2 [21]. To study the robustness of the idea we performed experiments in multiple tasks such as Neural Machine Translation, Text Summarization, and Image Captioning.

3.2 BAI in Transformer

In this Section, we propose a concrete implementation of the previous idea in the case of the Transformer [29]. The Transformer is a popular Encoder-Decoder architecture that succeeded in numerous NLP tasks. In essence, all layers are made of Self-Attention and FeedForward layers and their exact implementation is omitted since they do not impact the discussion.

We select the output of the last encoder layer E¯¯𝐸\overline{E}over¯ start_ARG italic_E end_ARG={e1,e2,,eN}subscript𝑒1subscript𝑒2subscript𝑒𝑁\{e_{1},e_{2},\ldots,e_{N}\}{ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } to be the pivots, which consists of N𝑁Nitalic_N feature vectors of size H𝐻Hitalic_H. During the second step, the length equalization step, we perform the dot product similarity between E¯¯𝐸\overline{E}over¯ start_ARG italic_E end_ARG and the target sequence embeddings DM×H𝐷superscript𝑀𝐻D\in\mathbb{R}^{M\times H}italic_D ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_H end_POSTSUPERSCRIPT where M𝑀Mitalic_M is the target length. Then apply Softmax and multiply the result by E¯¯𝐸\overline{E}over¯ start_ARG italic_E end_ARG:

R=Softmax((DE¯)H)E¯𝑅𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝐷superscript¯𝐸𝐻¯𝐸R=Softmax(\frac{(D\overline{E}^{\intercal})}{\sqrt{H}})\overline{E}italic_R = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG ( italic_D over¯ start_ARG italic_E end_ARG start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_H end_ARG end_ARG ) over¯ start_ARG italic_E end_ARG (1)

Note that although it involves the embeddings D𝐷Ditalic_D the auto-regressive property is preserved. The final BAI loss is defined as the Mean Square Loss (MSE) between the results of Equation 1 and D𝐷Ditalic_D:

β(R,D)=1MtM1H(rtdt)(rtdt)𝛽𝑅𝐷1𝑀superscriptsubscript𝑡𝑀1𝐻superscriptsubscript𝑟𝑡subscript𝑑𝑡subscript𝑟𝑡subscript𝑑𝑡\beta(R,D)=\frac{1}{M}\sum_{t}^{M}\frac{1}{H}(r_{t}-d_{t})^{\intercal}(r_{t}-d% _{t})italic_β ( italic_R , italic_D ) = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_H end_ARG ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (2)

where rtH×1subscript𝑟𝑡superscript𝐻1r_{t}\in\mathbb{R}^{H\times 1}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × 1 end_POSTSUPERSCRIPT denotes the element t𝑡titalic_t-th element of RN×H𝑅superscript𝑁𝐻R\in\mathbb{R}^{N\times H}italic_R ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_H end_POSTSUPERSCRIPT, and dtH×1subscript𝑑𝑡superscript𝐻1d_{t}\in\mathbb{R}^{H\times 1}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × 1 end_POSTSUPERSCRIPT defines the t𝑡titalic_t-th target embedding. During the training, Equation 2 is jointly trained with the standard Cross-Entropy (CE):

CE+BAI(Y,X)=λβ(Embed(Y),X)+t=1|M|logp(yt|yi<t,X)𝐶𝐸𝐵𝐴𝐼𝑌𝑋𝜆𝛽𝐸𝑚𝑏𝑒𝑑𝑌𝑋superscriptsubscript𝑡1𝑀𝑙𝑜𝑔𝑝conditionalsubscript𝑦𝑡subscript𝑦𝑖𝑡𝑋CE+BAI(Y,X)=\lambda\beta(Embed(Y),X)+\sum_{t=1}^{|M|}log\ p(y_{t}|y_{i<t},X)italic_C italic_E + italic_B italic_A italic_I ( italic_Y , italic_X ) = italic_λ italic_β ( italic_E italic_m italic_b italic_e italic_d ( italic_Y ) , italic_X ) + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_M | end_POSTSUPERSCRIPT italic_l italic_o italic_g italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT italic_i < italic_t end_POSTSUBSCRIPT , italic_X ) (3)

where X𝑋Xitalic_X and Y𝑌Yitalic_Y denote the input and target sequences respectively. Embed()𝐸𝑚𝑏𝑒𝑑Embed(\cdot)italic_E italic_m italic_b italic_e italic_d ( ⋅ ) denote the embedding function, and λ𝜆\lambdaitalic_λ is a configurable hyper-parameter.

3.3 BAI in ExpansionNet v2

ExpansionNet v2 [8] is another Encoder-Decoder architecture developed for Image Captioning. The encoder implements the (Block) Static Expansion, which distributes the input content across a group of arbitrary numbers of vectors, denoted by G={g1,g2,g|G|}𝐺subscript𝑔1subscript𝑔2subscript𝑔𝐺G=\{g_{1},g_{2}\ldots,g_{|G|}\}italic_G = { italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … , italic_g start_POSTSUBSCRIPT | italic_G | end_POSTSUBSCRIPT }, in the Forward Expansion. The result is called "expanded sequence" and is processed again in the Backward Expansion to retrieve the original sequence length. We refer to the original paper for the architectural details.

In the BAI’s Pivot selection step, instead of adopting directly the last encoder outputs, we adopt two intermediate results. Let Alg,Blgg×Hsubscriptsuperscript𝐴𝑔𝑙subscriptsuperscript𝐵𝑔𝑙superscript𝑔𝐻A^{g}_{l},B^{g}_{l}\in\mathbb{R}^{g\times H}italic_A start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_B start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_g × italic_H end_POSTSUPERSCRIPT for gG𝑔𝐺g\in Gitalic_g ∈ italic_G, the results of the Forward Expansion in the two operation paths, where gG𝑔𝐺g\in Gitalic_g ∈ italic_G denotes the expanded sequence length and l𝑙litalic_l identifies the layer. In particular, we compute A¯g=lLAlgsuperscript¯𝐴𝑔superscriptsubscript𝑙𝐿subscriptsuperscript𝐴𝑔𝑙\overline{A}^{g}=\sum_{l}^{L}{A}^{g}_{l}over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and B¯lg=lLBlgsubscriptsuperscript¯𝐵𝑔𝑙superscriptsubscript𝑙𝐿subscriptsuperscript𝐵𝑔𝑙\overline{B}^{g}_{l}=\sum_{l}^{L}{B}^{g}_{l}over¯ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT.

In the length equalization step we replicate a parameter-less version of the Backward Expansion:

R1gsubscriptsuperscript𝑅𝑔1\displaystyle R^{g}_{1}italic_R start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =ϕ(ReLU((D((A¯g+B¯g)/2))H))forgGformulae-sequenceabsentitalic-ϕ𝑅𝑒𝐿𝑈𝐷superscriptsuperscript¯𝐴𝑔superscript¯𝐵𝑔2𝐻𝑓𝑜𝑟𝑔𝐺\displaystyle=\phi(ReLU(\frac{(D((\overline{A}^{g}+\overline{B}^{g})/2)^{% \intercal})}{\sqrt{H}}))\ \ \ \ \ for\ g\in G= italic_ϕ ( italic_R italic_e italic_L italic_U ( divide start_ARG ( italic_D ( ( over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT + over¯ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) / 2 ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_H end_ARG end_ARG ) ) italic_f italic_o italic_r italic_g ∈ italic_G (4)
R2gsubscriptsuperscript𝑅𝑔2\displaystyle R^{g}_{2}italic_R start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =ϕ(ReLU((D((A¯g+B¯g)/2))H))forgGformulae-sequenceabsentitalic-ϕ𝑅𝑒𝐿𝑈𝐷superscriptsuperscript¯𝐴𝑔superscript¯𝐵𝑔2𝐻𝑓𝑜𝑟𝑔𝐺\displaystyle=\phi(ReLU(-\frac{(D((\overline{A}^{g}+\overline{B}^{g})/2)^{% \intercal})}{\sqrt{H}}))\ \ \ \ \ for\ g\in G= italic_ϕ ( italic_R italic_e italic_L italic_U ( - divide start_ARG ( italic_D ( ( over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT + over¯ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) / 2 ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_H end_ARG end_ARG ) ) italic_f italic_o italic_r italic_g ∈ italic_G
R1subscript𝑅1\displaystyle R_{1}italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =[R1g1,R1g2,,R1g|G|]2,R2=[R2g1,R2g2,,R2g|G|]2formulae-sequenceabsentsubscriptsuperscriptsubscript𝑅1subscript𝑔1superscriptsubscript𝑅1subscript𝑔2superscriptsubscript𝑅1subscript𝑔𝐺2subscript𝑅2subscriptsuperscriptsubscript𝑅2subscript𝑔1superscriptsubscript𝑅2subscript𝑔2superscriptsubscript𝑅2subscript𝑔𝐺2\displaystyle=[R_{1}^{g_{1}},R_{1}^{g_{2}},\ldots,R_{1}^{g_{|G|}}]_{2},\ \ R_{% 2}=[R_{2}^{g_{1}},R_{2}^{g_{2}},\ldots,R_{2}^{g_{|G|}}]_{2}= [ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT | italic_G | end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT | italic_G | end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
A^^𝐴\displaystyle\hat{A}over^ start_ARG italic_A end_ARG =[A¯g1,A¯g2,,A¯g|G|]1,B^=[B¯g1,B¯g2,,B¯g|G|]1formulae-sequenceabsentsubscriptsuperscript¯𝐴subscript𝑔1superscript¯𝐴subscript𝑔2superscript¯𝐴subscript𝑔𝐺1^𝐵subscriptsuperscript¯𝐵subscript𝑔1superscript¯𝐵subscript𝑔2superscript¯𝐵subscript𝑔𝐺1\displaystyle=[\overline{A}^{g_{1}},\overline{A}^{g_{2}},\ldots,\overline{A}^{% g_{|G|}}]_{1},\ \ \hat{B}=[\overline{B}^{g_{1}},\overline{B}^{g_{2}},\ldots,% \overline{B}^{g_{|G|}}]_{1}= [ over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , over¯ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT | italic_G | end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_B end_ARG = [ over¯ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , over¯ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , over¯ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT | italic_G | end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
R𝑅\displaystyle Ritalic_R =(R1A^|G|+R2B^|G|)/2absentsubscript𝑅1^𝐴𝐺subscript𝑅2^𝐵𝐺2\displaystyle=(\frac{R_{1}\hat{A}}{|G|}+\frac{R_{2}\hat{B}}{|G|})/2= ( divide start_ARG italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over^ start_ARG italic_A end_ARG end_ARG start_ARG | italic_G | end_ARG + divide start_ARG italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over^ start_ARG italic_B end_ARG end_ARG start_ARG | italic_G | end_ARG ) / 2

where ϕitalic-ϕ\phiitalic_ϕ denotes a row-wise normalization function [8] and [,,,]nsubscript𝑛[\bullet,\bullet,\ldots,\bullet]_{n}[ ∙ , ∙ , … , ∙ ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the concatenation over the n𝑛nitalic_n-th axis.

The third step of BAI𝐵𝐴𝐼BAIitalic_B italic_A italic_I is equivalent to the Equation 2 the case described in Section 3.2. We define the BAI loss as the MSE between the result of Equation 4 and D𝐷Ditalic_D and perform joint optimization with the Cross-Enropy.

3.4 BAI in Large Pre-Trained Models

Most popular large pre-trained models are based on the Transformer architecture, such as Flan-T5 [3], mBART [17], and GPT-2 [21]. For this reason, BAI can be applied in the same way as described in Section 3.2 with few exceptions.

In the case of GPT-2, the architecture differs compared to the vanilla Transformer because of the absence of the encoder. In this case, we select the Pivot elements to be the final state of the input sequence.

4 Experimental Setup

To evaluate the generalizability of our method we evaluate BAI across three tasks and five architectures. For Neural Machine Translation (NMT) we adopt the Transformer [29] and mBART [17]. For Image Captioning (IC) we select the [8] and [29]. Finally, Flan-T-Small [3] and GPT-2 [21] are evaluated on Text Summarization (TS). The architecture of mBART, Flan-T5-Small, and GPT-2 are pre-trained according to the dataset and modalities reported in the respective works.

4.1 Dataset

A total of four datasets are adopted in the experiments. IC is evaluated on the Microsoft’s COCO 2014 [14] which is split according to Karpathy [10]. Overall, it consists of 113K training images, 5K images for the validation, and an additional 5K for the test. Each image is paired with five ground-truth captions. They are pre-processed by lower casing and punctuation removal. Words that occur less than five times are discarded, resulting in a total of 10K unique tokens. NMT is tested on the IWSLT 2015 English-Vietnamese (En-Vi) corpus, which consists of 133K sentences for training and 1268 for evaluation. Sequences whose post-tokenization length is greater than 150 are discarded. In all training instances, the target and the source language vocabulary are shared. Each vocabulary is created using the BPE algorithm [24]. TS is evaluated on the novel TIFU dataset [11], which consists of threads from the "TIFU" subreddit equipped with the "TL;DR" summary. We follow the Pegasus Split [33], the training set consists of similar-to\sim33K pairs, and 5K are reserved for testing. Additionally, we adopt the popular DialogSum [2], made of 12,460 training and 1500 testing dialogue-summary pairs. Since pre-trained models are adopted for these tasks, no additional pre-processing is applied to the dataset besides the subword tokenization defined in each respective model. The reason we adopted two datasets is motivated by the fact that, in early experiments, some pre-trained baseline models produced only trivial answers in the case of the TIFU dataset. As a result, they were unsuitable for experimentation and comparison. In contrast, all selected models achieved a satisfactory level of performance in the case of DialogSum. In both datasets, training, and testing samples are filtered according to the maximum sequence length supported by the pre-trained model.

Table 1: BAI augmented training results compared to the baselines, without BAI, on the MS-COCO test set. Metrics are denoted by M=METEOR, R=ROUGE, B=BLEU, S=SPICE and C=CIDEr-D. ΔΔ\Deltaroman_ΔC denotes the difference in the main score CIDEr-D compared to the baseline "XE only" training.
Model Training B1 B2 B3 B4 R M S C ΔΔ\Deltaroman_ΔC
Transformer [29] XE 76.0 59.9 46.3 35.6 57.4 29.0 22.2 119.4 0.0
Transformer XE+BAI 76.4 60.6 47.0 36.2 57.8 29.3 22.5 120.6 +1.2
ExpansionNet v2 [8] XE 77.6 62.0 48.2 37.2 58.1 29.4 22.5 123.5 0.0
ExpansionNet v2 XE+BAI 78.2 63.0 49.2 38.1 58.5 29.5 22.8 125.9 +2.4
Table 2: BAI augmented training results compared to the baselines in TS test datasets. ΔΔ\Deltaroman_ΔR denotes the difference in the main score ROUGE compared to the baseline (XE). All models are pre-trained according to the modalities of the respective papers and then fine-tuned on the benchmark dataset.
Dataset Model Training BLEU ROUGE ΔΔ\Deltaroman_ΔR
TIFU Flan-T5-Small [3] XE 20.56 32.29 0.0
TIFU Flan-T5-Small XE+BAI 21.96 33.45 +1.16
DialogSum Flan-T5-Small [3] XE 46.38 33.93 0.0
DialogSum Flan-T5-Small XE+BAI 46.21 34.81 +0.88
DialogSum GPT-2 [21] XE 39.99 24.74 0.0
DialogSum GPT-2 XE+BAI 39.50 25.09 +0.35

4.2 Training and Models Details

The Transformer follows the architecture of the Base Transformer reported in [29] consisting of N=6𝑁6N=6italic_N = 6 encoder and decoder layers, dimension H=512𝐻512H=512italic_H = 512, FeedForward size of FF=2048𝐹𝐹2048FF=2048italic_F italic_F = 2048 and 8888 attention heads. ExpansionNet v2 follows the configuration of [8], it consists of the Swin-Transformer-Large backbone [19] and N=3𝑁3N=3italic_N = 3 encoder and decoder layers. We adopt the Static expansion coefficients of G={32,64,128,256,512}𝐺3264128256512G=\{32,64,128,256,512\}italic_G = { 32 , 64 , 128 , 256 , 512 } in the encoder and a dynamic expansion coefficient of 16 in the decoder. The remaining hyper-parameters are the same as the Transformer. The Transformer will be adopted in the case of NMT and TS. ExpansionNet v2 will be deployed for IC. We refer to the original paper for the architecture details of Flan-T5-Small [3], mBART [17] and GPT-2 [21].

For all three problems, the standard Cross-Entropy (CE) loss is adopted. As described in Equation 3 we jointly train CE and BAI using a parameter λ𝜆\lambdaitalic_λ which weights the contribution of the BAI loss. For each task, we report the batch size, batching criteria, optimizer, learning rate strategy, and the hyper-parameters of the function Λ:t[0,1]:Λ𝑡01\Lambda:t\rightarrow[0,1]\subset\mathbb{R}roman_Λ : italic_t → [ 0 , 1 ] ⊂ blackboard_R which provides the BAI weight λ=Λ(t)𝜆Λ𝑡\lambda=\Lambda(t)italic_λ = roman_Λ ( italic_t ) in the iteration t𝑡titalic_t. We define ΛΛ\Lambdaroman_Λ as follows:

Λ(t)=η+11+e(t/Tϕ)/γ(1η)Λ𝑡𝜂11superscript𝑒𝑡𝑇italic-ϕ𝛾1𝜂\Lambda(t)=\eta+\frac{1}{1+e^{-(t/T-\phi)/\gamma}}(1-\eta)roman_Λ ( italic_t ) = italic_η + divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - ( italic_t / italic_T - italic_ϕ ) / italic_γ end_POSTSUPERSCRIPT end_ARG ( 1 - italic_η ) (5)

where T𝑇Titalic_T denotes the number of iterations in one epoch, η𝜂\etaitalic_η denote the minimum weight, γ𝛾\gammaitalic_γ determines the velocity of the weight increase and ϕitalic-ϕ\phiitalic_ϕ controls the base of the slope.

IC Training. RAdam [16] optimizer with β1=0.9,β2=0.98formulae-sequencesubscript𝛽10.9subscript𝛽20.98\beta_{1}=0.9,\beta_{2}=0.98italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.98, batch size of 48, random batching criteria, an initial learning rate of 2e-4, warmed up for 10000 iterations, then annealed by 0.8 every 2 epochs for 20 epochs. ΛΛ\Lambdaroman_Λ is defined by η𝜂\etaitalic_η=1e-3, γ𝛾\gammaitalic_γ=0.5 and ϕitalic-ϕ\phiitalic_ϕ=15 in the case of ExpansionNet v2. When reported, in the case of the Transformer is configured as η𝜂\etaitalic_η=1e-7, γ𝛾\gammaitalic_γ=2.0 and ϕitalic-ϕ\phiitalic_ϕ=24.

NMT Training. Adam optimizer [12] with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.98subscript𝛽20.98\beta_{2}=0.98italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.98, batching criteria based on the source sentence length, token batch size of 4096 and the Noam learning rate as described in [29] with 4000 warm-up steps, trained for 300 epochs. The BAI weight function ΛΛ\Lambdaroman_Λ is characterized by η𝜂\etaitalic_η=1e-12, γ𝛾\gammaitalic_γ=0.1 and ϕitalic-ϕ\phiitalic_ϕ=250 for models trained from scratch. Pre-trained models are fine-tuned for 3 epochs and ϕitalic-ϕ\phiitalic_ϕ is set to two.

TS Training. The same optimizer and learning rate for NMT is adopted. Trained for 2 epochs in TIFU and 10 epochs in DialogSum. A batch size of 2048 and no warming steps. ΛΛ\Lambdaroman_Λ is defined by η𝜂\etaitalic_η=1e-4, γ𝛾\gammaitalic_γ=0.5 and ϕitalic-ϕ\phiitalic_ϕ=0.5 for Flan-T5-Small. Whereas η𝜂\etaitalic_η=1e-12, γ𝛾\gammaitalic_γ=0.1 and ϕitalic-ϕ\phiitalic_ϕ=12 are the configurations for GPT-2.

While the general learning rate of the training follows the convention of the original papers, the definition of the ΛΛ\Lambdaroman_Λ function was chosen according to empirical rules formulated during the experimental stage. More details can be found in Section 5.3.

4.3 Evaluation Methods

In IC, the model is evaluated on the standard evaluation metrics of CIDEr-D , METEOR , SPICE , ROUGE , BLEU . In TS we adopt the BLEU and ROUGE scores. For NMT we use the BLEU score only. Beam Search is adopted during inference, with a beam width of 3 in case of IC, and 4 in the case of NMT and TS. In the case of pre-trained models, greedy decoding is performed.

5 Results

In this Section, we observe the effectiveness of BAI according to different tasks and architectures. We first present the impact of BAI in the Transformer and ExpansionNet v2 when compared with the standard Cross-Entropy training and some related works. Then, we will showcase additional analysis to assess the impact of BAI besides standard benchmark metrics. Following, we discuss the impact of the importance of the weight term in the result of BAI. Finally, we showcase some qualitative results.

5.1 BAI Results

Tables 1, 2, and 3 report the performance improvements generated by our proposed method against the baselines, in IC, NMT, and TS.

Tables 1 and 2 showcase the performance improvements across standard Image Captioning metrics and Text Summarization. It can be observed that, both Transformer and ExpansionNet v2 benefit from our method, with an increase of 1.2 CIDEr-D in the first, and 2.4 CIDEr-D in the latter. Whereas, Flan-T5-Small and GPT-2 reported an increase of 0.88 and 0.35 ROUGE compared to the respective baselines. These results indicate the robustness and flexibility of our approach to different architectures, in particular, the improvement can be seen regardless of the model and pivot selection. In Table 1 a difference in the improvements of 1.2 CIDEr-D can be observed between the two architectures. This suggests that the method is sensible to the architecture or a proper selection of pivots. However, this characteristic can be regarded as a strength since BAI proved to be effective in the case of the most popular Seq2Seq model, the Transformer, but new architectures and strategies can be developed to benefit more from our approach.

In Table 3 we report the difference in BLEU score for the NMT task. It can be observed that augmenting the Cross-Entropy with BAI is beneficial in both situations, when the model is trained from scratch, represented by the base Transformer, and when it is done during the fine-tuning of a multi-lingual large pre-trained model mBART, when the standard Cross-Entropy is augmented with BAI. This aspect is also supported by Table 2, which showcase the performance improvements also in the case of the pre-trained models Flan-T5-Small and GPT-2.

Table 3: BAI augmented training results on the IWSLT15 En-Vi test set. ΔΔ\Deltaroman_ΔBLEU denotes the difference compared to the respective Cross-Entropy-only result. The star symbol denotes the model is pre-trained and fine-tuned, otherwise the model is trained from scratch.
Model Training BLEU ΔΔ\Deltaroman_ΔBLEU
Transformer [29] XE 31.13 0.0
Transformer XE+BAI 31.62 +0.49
mBART [17] XE 44.37 0.0
mBART XE+BAI 49.33 +4.96

In Table 3 a significant difference in the improvements can be noted in the pre-trained model compared to the one trained from scratch. We hypothesize this is caused by the fact that the BAI training appears to be the most effective when the weight term is low at the early stages of the training but high at the final epochs. This situation is perfectly recreated by the pre-training and fine-tuning of mBART, where the BAI’s contribution is virtually zero during the pre-training, and it is introduced only when the model reaches the performance plateau. More details can be found in Section 5.3.

Table 4: Impact of different training strategies using L2R and R2L data and the comparison against BAI.
Training method B1 B4 R M S C
L2R XE (Baseline) 77.1 36.6 58.0 29.4 22.6 122.7
R2L XE 77.0 36.0 57.7 29.0 20.9 122.5
R2L XE then L2R XE 75.4 35.2 57.2 22.5 29.3 120.1
L2R+R2L XE 76.0 35.5 57.3 22.4 29.1 119.3
L2R XE + BAI 78.2 38.1 58.5 29.5 22.8 125.9
R2L XE + BAI 78.0 37.7 58.1 29.3 22.8 124.3

5.2 BAI Against Other Bidirectionality Approaches

While early works on bidirectionality focused on recurrent architectures, in the last years most techniques were developed for the Transformer in different applications. In this regard, we compare our proposal to NAT [6] (NMT), and CBTIC [39] (for IC) as two representatives of the algorithmic-based work on bidirectional and architecture-based approach for bidirectionality respectively. We train our models using the configurations reported in Section 4.2.

For completeness, we first showcase the ineffectiveness of inducing both Left-to-Right (L2R) and Right-to-Left (R2L) knowledge in the model in simple ways, i.e. without architectural or algorithmic incentives. For example, we use the ExpansionNet architecture for the image captioning task and adopt the training configuration described in Section 4.2. In Table 4 we observe that combining L2R and R2L data with simple strategies does not lead to improvements and sometimes can even harm the performances. In contrast, BAI improves the output quality regardless of the L2R or R2L111Training on R2L data leads to artefacts described in [9] regardless of BAI. approach training with an increase of 3.2 and 1.8 CIDEr-D in the two cases.

In Table 5, we compare our method against the CBTIC architecture. In particular, we re-create the CBTIC environment by training captioning models on Faster-RCNN features as described in [1]. Both the architectures of Transformer and ExpansionNet v2, shortened respectively as "Transf." and "ExpV2", significantly benefit from BAI, confirming its robustness to different visual features. In particular, in the "ExpV2" case, BAI can increase the initial score to 3.4 CIDEr compared to the 2.7 increase observed in the CBTIC architecture. Whereas, in the case of Transformer, BAI led to an increase of 17.8. However, the magnitude of such an improvement in the latter case is an outlier that should not be attributed to BAI alone, but rather an unfavourable experimental training setup for the baseline, since it was designed for ExpansionNet v2.

Compared to CBTIC, BAI improves performance without requiring architectural changes, which is reflected in the constant amount of FLOPs with or without BAI. This aspect can be particularly appreciated in the case of pre-trained models, where architectural modifications can be very time-consuming since they would require the re-training of the foundation model.

In Table 6, we showcase the comparison between our proposal and NAT, in the case of NMT. We re-create the experimental setup of the latter by training from scratch the Transformer architecture presented in Section 4.2 on the IWSLT16 En-De dataset. Here, it can be observed that our method improves the BLEU score by 0.3 without changing the model inference speed whereas, NAT, in the vanilla formulation, focuses on reducing the inference time at the cost of slight degradation of performances.

Table 5: Comparison between BAI and CBTIC performances on the MS-COCO validation set. Models are trained with Cross-Entropy and Up-Down features [1]. δ𝛿\deltaitalic_δ denotes the CIDEr-D improvement against the respective baselines. FLOP computation assumes a caption length of 20 and 36 visual features include the backbone.
Method B4 R M C (δ𝛿\deltaitalic_δ) FLOPs
ExpV2 34.5 56.2 28.0 112.9 (0.0) 8.8 G
ExpV2 w/ BAI 35.3 56.8 28.2 116.3 (+3.4) 8.8 G
Transf. 28.1 51.6 25.1 95.7 (0.0) 7.63 G
Transf. w/ BAI 34.8 56.6 28.3 113.5 (+17.8) 7.63 G
Transf. [39] 35.4 56.3 27.8 111.7 (0.0) 7.63 G
CBTIC [39] 35.6 56.8 28.1 114.4 (+2.7) 8.97 G
Table 6: Comparison between BAI and NAT in the IWSLT16 En-De. δ𝛿\deltaitalic_δ denotes the difference compared to the respective baseline value. NVIDIA Tesla P100 was used in [6]. Our experiments are performed on NVIDIA A100.
Method BLEU (δ)𝛿(\delta)( italic_δ ) Latency
Transformer 29.83 (0.0) 71 ms
Transformer w/ BAI 30.13 (+0.3) 71 ms
Transformer [39] 29.70 (0.0) 607 ms
NAT [39] 28.16 (-1.54) 257 ms

5.3 BAI Weight Term Impact

In this Section, we showcase the impact of the Weight Term choice on the effectiveness of BAI. Although we focused on the case of Image Captioning as the application example for the analysis, similar behaviours were observed in the other selected tasks.

In Figure 2 we plot the weight term function defined in Section 4.2 compared to other hand-crafted functions. The alternative weight functions were designed to represent orthogonal criteria such as constant weight, increasing weight, and decreasing weight.

In Figure 2.a-c) it can be observed that BAI leads to different results according to the weight term function. For instance, a large term that decreases over time (ΛΛ\Lambdaroman_Λ5) seems to introduce too much noise to the standard Cross-Entropy learning. However, the problem does not lie in the initial magnitude of the term, as can be seen in the contrary case, ΛΛ\Lambdaroman_Λ4, where the term increases over the epochs and improves the final CIDEr-D score. In general, the magnitude alone does not suggest a predictable pattern, since the cases of ΛΛ\Lambdaroman_Λ1, ΛΛ\Lambdaroman_Λ2, and ΛΛ\Lambdaroman_Λ3 produced mixed results. Overall, the effectiveness of BAI depends on the weight function and it can be beneficial or detrimental depending on the design of the latter. Fortunately, in the case of ΛsuperscriptΛ\Lambda^{*}roman_Λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we observed that the practice of keeping the term small in the early stages and increasing its magnitude when the model reaches the plateau in the Cross-Entropy loss (Figure 2-b) empirically leads to the best result. This motivated the design of the weight term function for all three tasks in Section 5.1.

Refer to caption
Figure 2: BAI weight function impact on Image Captioning performances. a) Depiction of several strategies of weight term functions. Λ1(t)Λ1𝑡\Lambda 1(t)roman_Λ 1 ( italic_t )=103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, Λ2(t)Λ2𝑡\Lambda 2(t)roman_Λ 2 ( italic_t )=106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, Λ3(t)Λ3𝑡\Lambda 3(t)roman_Λ 3 ( italic_t )=1111, Λ4(t)Λ4𝑡\Lambda 4(t)roman_Λ 4 ( italic_t )=106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT +++ (1106)1superscript106(1-10^{-6})( 1 - 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT ) \cdot (t/30)𝑡30(t/30)( italic_t / 30 ), Λ5(t)Λ5𝑡\Lambda 5(t)roman_Λ 5 ( italic_t )=(1106)1superscript106(1-10^{-6})( 1 - 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT ) \cdot (30t)/3030𝑡30(30-t)/30( 30 - italic_t ) / 30 +++ 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, and Λ(t)superscriptΛ𝑡\Lambda^{*}(t)roman_Λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_t ) is the function described in Section 4.2. t𝑡titalic_t denotes the epoch. b) Validation score of the baseline architecture (ExpansionNet v2) without BAI. The red line denotes the point where the model achieves the highest score. c) Best CIDEr-D score observed from different selections of ΛΛ\Lambdaroman_Λ.

6 Conclusion and Future Works

In this work, we tackled the problem of integrating bidirectional awareness into Seq2Seq auto-regressive models. To do so, we first introduce the concept of pivots, defined as network elements that can be trained on auxiliary losses that do not necessarily correlate to the one required by the task but can be helpful in a better accomplishment of the latter. Leveraging this concept, we introduce the Bidirectional Awareness Induction (BAI) and train pivots over a bidirectional loss without breaking the auto-regressive property. The practice appears to increase the quality of the intermediate representations and experimental results involving three architectures, the Transformer, ExpansionNet v2 and GIT and three tasks, such as Neural Machine Translation, Text Summarization and Image Captioning, showcase our method’s robustness, effectiveness, and flexibility. Future experiments will focus on addressing the current limitations and explore the training of pivots with additional auxiliary losses, beyond the bidirectional one proposed in this work.

References

  • [1] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6077–6086 (2018)
  • [2] Chen, Y., Liu, Y., Chen, L., Zhang, Y.: DialogSum: A real-life scenario dialogue summarization dataset. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 5062–5074. Association for Computational Linguistics, Online (Aug 2021). https://doi.org/10.18653/v1/2021.findings-acl.449, https://aclanthology.org/2021.findings-acl.449
  • [3] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
  • [4] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • [5] Ding, L., Wu, D., Tao, D.: Improving neural machine translation by bidirectional training. arXiv preprint arXiv:2109.07780 (2021)
  • [6] Gu, J., Bradbury, J., Xiong, C., Li, V.O., Socher, R.: Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281 (2017)
  • [7] Gu, J., Kong, X.: Fully non-autoregressive neural machine translation: Tricks of the trade. arXiv preprint arXiv:2012.15833 (2020)
  • [8] Hu, J.C., Cavicchioli, R., Capotondi, A.: Exploiting multiple sequence lengths in fast end to end training for image captioning. In: 2023 IEEE International Conference on Big Data (BigData). pp. 2173–2182. IEEE Computer Society (2023)
  • [9] Hu, J.C., Cavicchioli, R., Capotondi, A.: A request for clarity over the end of sequence token in the self-critical sequence training. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds.) Image Analysis and Processing – ICIAP 2023. pp. 39–50. Springer Nature Switzerland, Cham (2023)
  • [10] Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3128–3137 (2015)
  • [11] Kim, B., Kim, H., Kim, G.: Abstractive Summarization of Reddit Posts with Multi-level Memory Networks. In: NAACL-HLT (2019)
  • [12] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017)
  • [13] Li, X., Thickstun, J., Gulrajani, I., Liang, P.S., Hashimoto, T.B.: Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems 35, 4328–4343 (2022)
  • [14] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)
  • [15] Liu, L., Utiyama, M., Finch, A., Sumita, E.: Agreement on target-bidirectional neural machine translation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 411–416 (2016)
  • [16] Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Han, J.: On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265 (2019)
  • [17] Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L.: Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, 726–742 (2020)
  • [18] Liu, Y., Yang, T., Huang, S., Zhang, Z., Huang, H., Wei, F., Deng, W., Sun, F., Zhang, Q.: Text diffusion with reinforced conditioning. arXiv preprint arXiv:2402.14843 (2024)
  • [19] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022 (2021)
  • [20] Mehri, S., Sigal, L.: Middle-out decoding. Advances in Neural Information Processing Systems 31 (2018)
  • [21] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
  • [22] Sammani, F., Elsayed, M.: Look and modify: Modification networks for image captioning. arXiv preprint arXiv:1909.03169 (2019)
  • [23] Sammani, F., Melas-Kyriazi, L.: Show, edit and tell: a framework for editing image captions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4808–4816 (2020)
  • [24] Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)
  • [25] Song, Z., Zhou, X., Mao, Z., Tan, J.: Image captioning with context-aware auxiliary guidance. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 2584–2592 (2021)
  • [26] Sun, Q., Lee, S., Batra, D.: Bidirectional beam search: Forward-backward inference in neural sequence models for fill-in-the-blank image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6961–6969 (2017)
  • [27] Tan, Z., Wang, S., Yang, Z., Chen, G., Huang, X., Sun, M., Liu, Y.: Neural machine translation: A review of methods, resources, and tools. AI Open 1, 5–21 (2020)
  • [28] Tang, Z., Wang, P., Zhou, K., Li, J., Cao, Z., Zhang, M.: Can diffusion model achieve better performance in text generation? bridging the gap between training and inference! (2023)
  • [29] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)
  • [30] Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3156–3164 (2015)
  • [31] Wang, C., Yang, H., Meinel, C.: Image captioning with deep bidirectional lstms and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14(2s), 1–20 (2018)
  • [32] Wang, C., Zhang, J., Chen, H.: Semi-autoregressive neural machine translation. arXiv preprint arXiv:1808.08583 (2018)
  • [33] Zhang, J., Zhao, Y., Saleh, M., Liu, P.J.: Pegasus: Pre-training with extracted gap-sentences for abstractive summarization (2019)
  • [34] Zhang, X., Su, J., Qin, Y., Liu, Y., Ji, R., Wang, H.: Asynchronous bidirectional decoding for neural machine translation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
  • [35] Zhang, Z., Wu, S., Liu, S., Li, M., Zhou, M., Xu, T.: Regularizing neural machine translation by target-bidirectional agreement. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 443–450 (2019)
  • [36] Zheng, Z., Zhou, H., Huang, S., Mou, L., Dai, X., Chen, J., Tu, Z.: Modeling past and future for neural machine translation. Transactions of the Association for Computational Linguistics 6, 145–157 (2018)
  • [37] Zhou, L., Zhang, J., Zong, C.: Synchronous bidirectional neural machine translation. Transactions of the Association for Computational Linguistics 7, 91–105 (2019)
  • [38] Zhou, L., Zhang, J., Zong, C., Yu, H.: Sequence generation: From both sides to the middle. arXiv preprint arXiv:1906.09601 (2019)
  • [39] Zhou, Y., Hu, Z., Liu, D., Ben, H., Wang, M.: Compact bidirectional transformer for image captioning. arXiv preprint arXiv:2201.01984 (2022)