\history

Date of publication 28 June 2024. 10.1109/ACCESS.2024.3420710

\tfootnote

This work was supported in part by Xunta de Galicia grants ED481B-2021-118, ED481B-2022-093, and ED431C 2022/04, Spain; Ministerio de Educación grant FPU21/00798, Spain; Ministerio de Ciencia e Innovación grant TED2021-130824B-C21, Spain; and by the Galician Supercomputing Center (CESGA), Spain, which provided the computing resources required.

\corresp

Corresponding author: Andrea Busto-Castiñeira (e-mail: [email protected]).

Predictability and Causality in Spanish and English Natural Language Generation

ANDREA BUSTO-CASTIñEIRA1    FRANCISCO J. GONZáLEZ-CASTAñO1    SILVIA GARCíA-MéNDEZ1    and FRANCISCO DE ARRIBA-PéREZ1 Information Technologies Group, atlanTTic, Telecommunication Engineering School, University of Vigo, 36310 Vigo, Spain
Abstract

In recent years, the field of Natural Language Generation (NLG) has been boosted by the recent advances in deep learning technologies. Nonetheless, these new data-intensive methods introduce language-dependent disparities in NLG as the main training data sets are in English. Also, most neural NLG systems use decoder-only (causal) transformer language models, which work well for English, but were not designed with other languages in mind. In this work we depart from the hypothesis that they may introduce generation bias in target languages with less rigid word ordering, subject omission, or different attachment preferences for relative clauses, so that for these target languages other language generation strategies may be more desirable.

This paper first compares causal and non-causal language modeling for English and Spanish, two languages with different grammatical structures and over 1.5 billion and 0.5 billion speakers, respectively. For this purpose, we define a novel metric of average causal and non-causal context-conditioned entropy of the grammatical category distribution for both languages as an information-theoretic a priori approach.

The evaluation of natural text sources (such as training data) in both languages reveals lower average non-causal conditional entropy in Spanish and lower causal conditional entropy in English. According to this experiment, Spanish is more predictable than English given a non-causal context. Then, by applying a conditional relative entropy metric to text generation experiments, we obtain as insights that the best performance is respectively achieved with causal NLG in English, and with non-causal NLG in Spanish. These insights support further research in NLG in Spanish using bidirectional transformer language models.

Index Terms:
Language predictability, natural language generation, non-causal language modeling, Spanish language, transformer language models.
\titlepgskip

=-21pt

I Introduction

Thanks to their capacity to acquire universal language representations from vast amounts of unlabeled text data, transformer-based Natural Language Generation (NLG) models [1] have achieved unprecedented success [2].

Transformers are based on Sequence-to-Sequence (seq2seq) language models [3] by enhancing them with positional encoding, which enables parallel training while still considering word order, and a novel self-attention mechanism that selects the most relevant parts of the input sequence. The unsupervised nature of transformers’ pre-training facilitates handling vast raw text data.

However, the linguistic prevalence of English on the Internet is a primary source of bias [4]. Most evaluation data sets and benchmarks are primarily or entirely written in English with very few exceptions [5], and therefore the most innovative contributions rarely target other languages. This linguistic data imbalance has a detrimental effect on polyglot language models’ word embeddings [6, 7], as tokenizers assign longer tokens to character sequences in languages with more extensive representation [8, 9, 10], despite of some proposed solutions such as those in [11] and [12]. The multilingual NLP paradigm of cross-lingual transfer learning, which tends to view non-English languages as particular use cases, makes language models favor English-like grammars with more strict word ordering and explicit subject [13].

Nowadays, the majority of generative language models are decoder-only transformers, being OpenAI’s GPT-3/3.5/4 [14], and ChatGPT the most popular, and Big Science’s BLOOM, Meta AI’s OPT, and Google AI’s BARD also well-known examples. Most of these models are multilingual, however, their performance varies between languages. This led to monolingual implementations of the smaller model GPT-2 in languages other than English, for example in German111Available at: https://cedille.ai, June 2024., French1, Italian [15], and Spanish [16].

These generative models are exclusively causal, that is, they produce text from left to right by recursively feeding the model with previously generated sequences. As in the case of Recurrent Neural Networks (RNN), decoder-only transformers are expectation-based word predictors. These systems tend to favor structures in which related elements are close along the sequence, such as relative clause attachments to syntactically lower nominals in ambiguous contexts, which fits nicely into English syntax [17].

However, the mutually beneficial congruence between causal language modeling and English may not apply to other languages. Not only does Spanish prefer a higher nominal attachment in the resolution of ambiguous relative clauses, but its syntax is also highly flexible, even within declarative sentences [18]. This is strongly opposed to the more strict subject-verb-object structure of the English language, which allows for few inversion exceptions [19].

Unlike causal language models, encoder-only non-causal language models generate word embeddings using bidirectional contexts, which means that the model output can be conditioned by both left and right tokens. This eliminates the output sequence’s sequential dependencies and allows alternative generation orders.

In light of this, we depart from the hypothesis that decoder-only (causal) transformer language models may introduce generation bias in target languages with less rigid word ordering than English, subject omission, or different attachment preferences for relative clauses, so that for these target languages other language generation strategies may be more desirable. To put this hypothesis to test, in addition to English, we consider and Spanish, a language with a different grammatical structure and also a broad base of speakers (these languages sum over 1.5 billion and 0.5 billion speakers, respectively, a substantial share of the world’s population). However, the approaches in this work can be extended to obtain insights on other languages and NLP tasks.

Our contributions are:

  1. A.

    First, we present a novel information-theoretic approach to study language predictability. We compare the causal context-conditioned entropy and the non-causal context-conditioned entropy of the grammatical category distribution of source natural texts to assess whether their language is more predictable from causal or non-causal language contexts. This reveals lower average non-causal conditional entropy in Spanish and lower causal conditional entropy in English. According to this assessment, Spanish is more predictable than English given a non-causal context.

  2. B.

    Then, using both automatic (based on conditional relative entropy) and manual evaluation methodologies, we put decoder-only and encoder-only transformer language models to test to assess empirical causal and non-causal NLG performance, seeking to evaluate if the currently dominant causal NLG paradigm is adequate from a language-agnostic perspective or whether specific languages may benefit from other word generation orderings. We obtain as insights that the best performance is achieved with causal NLG in English and non-causal NLG in Spanish. These insights support further research in NLG in Spanish using bidirectional transformer language models instead of the dominant decoder-only ones.

The rest of this paper is organized as follows. Section II reviews related work on both psycholinguistic language predictability and language model causality in NLG. Section III describes the proposed analytical methodology used for the experiments. Sections IV and V present the details and results of the assessments of predictability and text generation performance, respectively. Section VI summarizes and discusses the results obtained. Finally, Section VII concludes the paper.

II Related work

In this section, we discuss relevant works on both causality in NLG (Section II-A) and language predictability (Section II-B).

II-A Causality in generative transformer language models

The contextual awareness of a transformer is controlled by self-attention. The base concept behind this attention mechanism is a mapping of a query (𝐪𝐪\mathbf{q}bold_q) into pairs of keys (𝐤𝐤\mathbf{k}bold_k) and values (𝐯𝐯\mathbf{v}bold_v). By respectively denoting the queries’, keys’, and value sets’ matrices as 𝐐𝐐\mathbf{Q}bold_Q, 𝐊𝐊\mathbf{K}bold_K and 𝐕𝐕\mathbf{V}bold_V, we define self-attention as:

Attention(𝐐,𝐊,𝐕)=softmax(𝐐𝐊T|𝐤|)𝐕𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝐐𝐊𝐕𝑠𝑜𝑓𝑡𝑚𝑎𝑥superscript𝐐𝐊𝑇𝐤𝐕Attention\left(\mathbf{Q},\mathbf{K},\mathbf{V}\right)=softmax\left(\frac{% \mathbf{Q}\mathbf{K}^{T}}{\sqrt{\left|\mathbf{k}\right|}}\right)\mathbf{V}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( bold_Q , bold_K , bold_V ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG bold_QK start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG | bold_k | end_ARG end_ARG ) bold_V (1)

Transformers, rather than a single attention function, project queries, keys, and values onto hhitalic_h separate heads. This is called multi-head attention:

MultiHead(𝐐,𝐊,𝐕)=concat(head1,,headh)𝐖O𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑𝐐𝐊𝐕𝑐𝑜𝑛𝑐𝑎𝑡subscripthead1subscriptheadsuperscript𝐖𝑂MultiHead\left(\mathbf{Q},\mathbf{K},\mathbf{V}\right)=concat\left(\mathrm{% head}_{1},\cdots,\mathrm{head}_{h}\right)\mathbf{W}^{O}italic_M italic_u italic_l italic_t italic_i italic_H italic_e italic_a italic_d ( bold_Q , bold_K , bold_V ) = italic_c italic_o italic_n italic_c italic_a italic_t ( roman_head start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , roman_head start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) bold_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT (2)

By denoting each head attention function as:

headi=Attention(𝐐𝐖iQ,𝐊𝐖iK,𝐕𝐖iV)subscripthead𝑖𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛superscriptsubscript𝐐𝐖𝑖𝑄superscriptsubscript𝐊𝐖𝑖𝐾superscriptsubscript𝐕𝐖𝑖𝑉\mathrm{head}_{i}=Attention\left(\mathbf{Q}\mathbf{W}_{i}^{Q},\mathbf{K}% \mathbf{W}_{i}^{K},\mathbf{V}\mathbf{W}_{i}^{V}\right)roman_head start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( bold_QW start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , bold_KW start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , bold_VW start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ) (3)

where 𝐖iQsuperscriptsubscript𝐖𝑖𝑄\mathbf{W}_{i}^{Q}bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, 𝐖iKsuperscriptsubscript𝐖𝑖𝐾\mathbf{W}_{i}^{K}bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, 𝐖iVsuperscriptsubscript𝐖𝑖𝑉\mathbf{W}_{i}^{V}bold_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT and 𝐖Osuperscript𝐖𝑂\mathbf{W}^{O}bold_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT are parameter projection matrices for the queries, keys, values, and output respectively.

This attention mechanism is present in all the layers of both the encoder and the decoder, if present. While the encoder’s attention is bidirectional, the decoder has two different types of attention: (i) a masked multi-head attention block that masks non-causal context and (ii) a bidirectional multi-head attention block that receives non-causal information from the encoder.

Even though this encoder-decoder architecture is popular in some NLP tasks such as machine translation [20, 21, 22, 23], several transformer-based models only have one of these components. By omitting the encoder in decoder-only transformers, all non-causal contextual dependencies are removed by exclusively using masked attention. Decoder-only transformers are nowadays the best performing task-agnostic NLG systems. Nevertheless, there exist some state-of-the-art non-causal NLG solutions. For example, non-causal language models can be trained for the Masked Language Modeling (MLM) objective, a task in which the language model predicts masked words within a sentence [24]. Typically, non-causal NLG systems are focused on particular tasks such as speech recognition [25, 26, 27], style transfer and grammar correction [28], textual data augmentation [29], and task-specific dialog systems [30, 31].

II-B Language predictability

Conditional entropy is a typical metric for evaluating the predictability of a problem given its input variables and expected output probability distribution [32, 33]. Conditional entropy H(XY)𝐻conditional𝑋𝑌H(X\mid Y)italic_H ( italic_X ∣ italic_Y ) measures the extra information carried by a variable X𝑋Xitalic_X when another conditional variable Y𝑌Yitalic_Y is available as side information.

In psycholinguistics, surprisal theory also uses this information-theoretic concept to quantify processing difficulty in sentence comprehension [34, 35]. Multiple studies provide empirical evidence for this expectation-based theory by showing correlations between textual surprisal and both test subjects’ reading times, as in [36], and brain activity, as in [37].

Even if generally accepted, surprisal theory does not model working memory in text comprehension, disregarding processing difficulties in integrating words or components that are widely apart within a text [38, 39, 40, 41, 42]. Lossy context surprisal [43] combines expectation and memory-based predictability theories by modeling working memory constraints as noise. Even though this model premise is independent of language, it can accurately reflect several language-specific text-processing phenomena.

Lossy context surprisal recreates structural forgetting by dropping part of the context and re-sampling it incorrectly from the a priori language knowledge probability model. Structural forgetting [44] is a common grammatical illusion in English in which ungrammatical double-embedded relative clauses can be perceived as correct. Probabilistic language expectations can determine this exclusively. In [45] it is proven that native and non-native speakers show structural forgetting in English, but do not behave this way when presented with the same syntactic structures in German or Dutch. This propensity of English probabilistic distribution to such backward prediction mistakes is coherent with the issue of non-causal English text generation.

The main goals of neural NLP and psycholinguistics approaches to language cognition are very similar: (i) to give formally explicit descriptions of the mental structures underpinning cognitive processes, and (ii) to explain the learning mechanisms behind them [46]. Even if research in these areas tends to diverge, recent contributions to the study of linguistic theory use language models [47], further evidencing their alignment.

However, to our knowledge, psycholinguistics concepts have yet to be applied to neural language modeling other than for data set elaboration [48]. With this in mind, our work provides a novel linguistic-based conditional entropy hypothesis test for language modeling causality (see contribution 1 above), whose findings can support future NLG designs and methodologies.

III Methodology

III-A Predictability hypothesis test

Causal language models predict the next token in a sequence of tokens. These models are solely concerned with the left context for sinistrodextral (i.e., written from left to right) languages such as Spanish and English (conversely, non-causal models trained on the MLM task consider the bidirectional context for blank-filling-based text generation).

Given a sequence of tokens X𝑋Xitalic_X as context, language models provide the probability mass function for the next predicted token X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG. For a generation index i<N𝑖𝑁i<Nitalic_i < italic_N, we define the N𝑁Nitalic_N-long input causal context as follows:

𝐱ci=[xiNxi1]Tsubscript𝐱subscript𝑐𝑖superscriptmatrixsubscript𝑥𝑖𝑁subscript𝑥𝑖1𝑇\mathbf{x}_{c_{i}}=\begin{bmatrix}x_{i-N}&\dots&x_{i-1}\end{bmatrix}^{T}bold_x start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_i - italic_N end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (4)

And the non-causal context as:

𝐱ni=[x0xi1xi+1xn]Tsubscript𝐱subscript𝑛𝑖superscriptmatrixsubscript𝑥0subscript𝑥𝑖1subscript𝑥𝑖1subscript𝑥𝑛𝑇\mathbf{x}_{n_{i}}=\begin{bmatrix}x_{0}&\dots&x_{i-1}&x_{i+1}&\dots&x_{n}\end{% bmatrix}^{T}bold_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (5)

So that we can express the output of a causal language model as:

𝐲ci=[p(X^i=v0Xc=𝐱ci)p(X^i=v|𝒱|1Xc=𝐱ci)]subscript𝐲subscript𝑐𝑖matrix𝑝subscript^𝑋𝑖conditionalsubscript𝑣0subscript𝑋𝑐subscript𝐱subscript𝑐𝑖𝑝subscript^𝑋𝑖conditionalsubscript𝑣𝒱1subscript𝑋𝑐subscript𝐱subscript𝑐𝑖\mathbf{y}_{c_{i}}=\begin{bmatrix}p(\hat{X}_{i}=v_{0}\mid X_{c}=\mathbf{x}_{c_% {i}})\\ \vdots\\ p(\hat{X}_{i}=v_{\left|\mathcal{V}\right|-1}\mid X_{c}=\mathbf{x}_{c_{i}})\end% {bmatrix}bold_y start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_p ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_p ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT | caligraphic_V | - 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] (6)

And the output of a non-causal language model as follows:

𝐲ni=[p(X^i=v0Xn=𝐱ni)p(X^i=v|𝒱|1Xn=𝐱ni)]subscript𝐲subscript𝑛𝑖matrix𝑝subscript^𝑋𝑖conditionalsubscript𝑣0subscript𝑋𝑛subscript𝐱subscript𝑛𝑖𝑝subscript^𝑋𝑖conditionalsubscript𝑣𝒱1subscript𝑋𝑛subscript𝐱subscript𝑛𝑖\mathbf{y}_{n_{i}}=\begin{bmatrix}p(\hat{X}_{i}=v_{0}\mid X_{n}=\mathbf{x}_{n_% {i}})\\ \vdots\\ p(\hat{X}_{i}=v_{\left|\mathcal{V}\right|-1}\mid X_{n}=\mathbf{x}_{n_{i}})\end% {bmatrix}bold_y start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_p ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_p ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT | caligraphic_V | - 1 end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] (7)

with vocabulary set 𝒱={v0v|𝒱|1}𝒱matrixsubscript𝑣0subscript𝑣𝒱1\mathcal{V}=\left\{\begin{matrix}v_{0}&\dots&v_{\left|\mathcal{V}\right|-1}% \end{matrix}\right\}caligraphic_V = { start_ARG start_ROW start_CELL italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_v start_POSTSUBSCRIPT | caligraphic_V | - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG } of size |𝒱|𝒱\left|\mathcal{V}\right|| caligraphic_V |.

As stated in Section II-B, we use a novel metric of conditional entropy to test whether a language is more or less predictable given causal or non-causal contexts (and thus, for example, whether Spanish NLG may benefit from non-causal language generation ordering). The less conditional entropy a problem has, the more predictable its outcome. As NLG is a language prediction task in which previously generated words are available as context, we want to compare the conditional entropy in two scenarios: (i) causal text generation, in which text is generated from left to right so that we provide words to the left of the predicted one as context; and (ii) non-causal text generation, which uses both left and right context for word prediction.

In order to test the predictability of causal and non-causal language models for English and Spanish, we compute and compare the average causal and non-causal conditional entropy for textual data in both languages:

H(X^Xc)¯=𝐱c𝒳cNp(Xc=𝐱c)H(X^Xc=𝐱c)¯𝐻conditional^𝑋subscript𝑋𝑐subscriptsubscript𝐱𝑐subscriptsuperscript𝒳𝑁𝑐𝑝subscript𝑋𝑐subscript𝐱𝑐𝐻conditional^𝑋subscript𝑋𝑐subscript𝐱𝑐\,\overline{\!{H(\hat{X}\mid X_{c})}}=\sum_{\mathbf{x}_{c}\in\mathcal{X}^{N}_{% c}}p(X_{c}=\mathbf{x}_{c})H(\hat{X}\mid X_{c}=\mathbf{x}_{c})over¯ start_ARG italic_H ( over^ start_ARG italic_X end_ARG ∣ italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_ARG = ∑ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) italic_H ( over^ start_ARG italic_X end_ARG ∣ italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) (8)
H(X^Xn)¯=𝐱n𝒳nNp(Xn=𝐱n)H(X^Xn=𝐱n)¯𝐻conditional^𝑋subscript𝑋𝑛subscriptsubscript𝐱𝑛subscriptsuperscript𝒳𝑁𝑛𝑝subscript𝑋𝑛subscript𝐱𝑛𝐻conditional^𝑋subscript𝑋𝑛subscript𝐱𝑛\,\overline{\!{H(\hat{X}\mid X_{n})}}=\sum_{\mathbf{x}_{n}\in\mathcal{X}^{N}_{% n}}p(X_{n}=\mathbf{x}_{n})H(\hat{X}\mid X_{n}=\mathbf{x}_{n})over¯ start_ARG italic_H ( over^ start_ARG italic_X end_ARG ∣ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG = ∑ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_H ( over^ start_ARG italic_X end_ARG ∣ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (9)

with:

H(X^X=𝐱)=x^𝒱p(x^X=𝐱)log1p(x^X=𝐱)𝐻conditional^𝑋𝑋𝐱subscript^𝑥𝒱𝑝conditional^𝑥𝑋𝐱1𝑝conditional^𝑥𝑋𝐱H(\hat{X}\mid X=\mathbf{x})=\sum_{\hat{x}\in\mathcal{V}}p(\hat{x}\mid X=% \mathbf{x})\log\frac{1}{p(\hat{x}\mid X=\mathbf{x})}italic_H ( over^ start_ARG italic_X end_ARG ∣ italic_X = bold_x ) = ∑ start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG ∈ caligraphic_V end_POSTSUBSCRIPT italic_p ( over^ start_ARG italic_x end_ARG ∣ italic_X = bold_x ) roman_log divide start_ARG 1 end_ARG start_ARG italic_p ( over^ start_ARG italic_x end_ARG ∣ italic_X = bold_x ) end_ARG (10)

It must be noted that both 𝒳cNsubscriptsuperscript𝒳𝑁𝑐\mathcal{X}^{N}_{c}caligraphic_X start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 𝒳nNsubscriptsuperscript𝒳𝑁𝑛\mathcal{X}^{N}_{n}caligraphic_X start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT have size |𝒱|Nsuperscript𝒱𝑁\left|\mathcal{V}\right|^{N}| caligraphic_V | start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Context length and vocabulary size determine the accuracy of our estimation results. As we have no previous information about token probability distribution, we model both p(X)𝑝𝑋p(X)italic_p ( italic_X ) and p(X^=x^X=𝐱)𝑝^𝑋conditional^𝑥𝑋𝐱p(\hat{X}=\hat{x}\mid X=\mathbf{x})italic_p ( over^ start_ARG italic_X end_ARG = over^ start_ARG italic_x end_ARG ∣ italic_X = bold_x ) as categorical distributions. The estimators used for these distributions are:

Pr^(X=𝐱i)=LiL^𝑃𝑟𝑋subscript𝐱𝑖subscript𝐿𝑖𝐿\widehat{Pr}(X=\mathbf{x}_{i})=\frac{L_{i}}{L}over^ start_ARG italic_P italic_r end_ARG ( italic_X = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_L end_ARG (11)

and

Pr^(X^=xjX=𝐱i)=LijLi^𝑃𝑟^𝑋conditionalsubscript𝑥𝑗𝑋subscript𝐱𝑖subscript𝐿subscript𝑖𝑗subscript𝐿𝑖\widehat{Pr}(\hat{X}=x_{j}\mid X=\mathbf{x}_{i})=\frac{L_{i_{j}}}{L_{i}}over^ start_ARG italic_P italic_r end_ARG ( over^ start_ARG italic_X end_ARG = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_X = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_L start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (12)

with L𝐿Litalic_L being the token sequence length, Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the number of instances of the context i𝑖iitalic_i, and Lijsubscript𝐿subscript𝑖𝑗L_{i_{j}}italic_L start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT the number of instances in which token j𝑗jitalic_j appears given context i𝑖iitalic_i.

In case both p(X)𝑝𝑋p(X)italic_p ( italic_X ) and p(X^=x^X=𝐱)𝑝^𝑋conditional^𝑥𝑋𝐱p(\hat{X}=\hat{x}\mid X=\mathbf{x})italic_p ( over^ start_ARG italic_X end_ARG = over^ start_ARG italic_x end_ARG ∣ italic_X = bold_x ) are discrete uniform distributions, these estimators have normalized variances |𝒱|N1Lsuperscript𝒱𝑁1𝐿\frac{\left|\mathcal{V}\right|^{N}-1}{L}divide start_ARG | caligraphic_V | start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_L end_ARG and |𝒱|N(|𝒱|1)Lsuperscript𝒱𝑁𝒱1𝐿\frac{\left|\mathcal{V}\right|^{N}\left(\left|\mathcal{V}\right|-1\right)}{L}divide start_ARG | caligraphic_V | start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( | caligraphic_V | - 1 ) end_ARG start_ARG italic_L end_ARG, respectively. This means that, in order to set our estimators’ normalized variances to a specific value, the number of analyzed tokens should be proportional to |𝒱|Nsuperscript𝒱𝑁\left|\mathcal{V}\right|^{N}| caligraphic_V | start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and |𝒱|N+1superscript𝒱𝑁1\left|\mathcal{V}\right|^{N+1}| caligraphic_V | start_POSTSUPERSCRIPT italic_N + 1 end_POSTSUPERSCRIPT, respectively.

Therefore, given the data available, neither word nor subword tokenization are feasible. We instead use a grammatical categorization based on Part-Of-Speech (POS) tagging. It reduces vocabulary size and data requirements dramatically while maintaining the original goal.

The resulting hypothesis test evaluates how predictable natural English and Spanish syntaxes are for causal and non-causal language models. Our first intuition is that non-causal predictability, as determined by the inverse of the non-causal context-conditioned entropy, will be higher for Spanish than for English and the opposite for causal predictability. By validating this, we can demonstrate that causal ordering may not be ideal for Spanish NLG, paving the way for further study of non-causal Spanish text generation approaches based on bidirectional transformers.

III-B Non-causal text generation

For non-causal NLG, first we start with a sequence of [MASK] tokens of the desired length K𝐾Kitalic_K. At each iteration, we re-sample every token once. We mask and fill tokens in groups of size N𝑁Nitalic_N. In order to fill the masked tokens, we sample the output of a non-causal language model, from which we remove adjacent tokens, short prefixes and suffixes, and unknown tokens to enhance the overall quality of the produced sequence. This process is formally described in Algorithm 1.

Algorithm 1 Non-causal text generation.
  𝐱[[MASK][MASK]]K𝐱subscriptmatrixdelimited-[]MASKdelimited-[]MASK𝐾\mathbf{x}\leftarrow\begin{bmatrix}\mathrm{[MASK]}&\dots&\mathrm{[MASK]}\end{% bmatrix}_{K}bold_x ← [ start_ARG start_ROW start_CELL [ roman_MASK ] end_CELL start_CELL … end_CELL start_CELL [ roman_MASK ] end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT
  i0𝑖0i\leftarrow 0italic_i ← 0
  while i𝑖iitalic_i <<< I𝐼Iitalic_I do
     index Shuffle([0K1])absentShufflematrix0missing-subexpression𝐾1\leftarrow\texttt{{Shuffle}}\left(\begin{bmatrix}0&\cdots&&K-1\end{bmatrix}\right)← Shuffle ( [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL end_CELL start_CELL italic_K - 1 end_CELL end_ROW end_ARG ] )
     i0𝑖0i\leftarrow 0italic_i ← 0
     while j𝑗jitalic_j <<< KN𝐾𝑁K-Nitalic_K - italic_N do
        masked \leftarrow index[[jmin(j+N,K1)]]delimited-[]matrixjjNK1\left[\begin{bmatrix}j&\cdots&\min\left(j+N,K-1\right)\end{bmatrix}\right][ [ start_ARG start_ROW start_CELL italic_j end_CELL start_CELL ⋯ end_CELL start_CELL roman_min ( italic_j + italic_N , italic_K - 1 ) end_CELL end_ROW end_ARG ] ]
        x[x[italic_x [masked]]\leftarrow] ← [MASK]
        𝐲NonCausalLM(𝐱)𝐲NonCausalLM𝐱\mathbf{y}\leftarrow\texttt{{NonCausalLM}}\left(\mathbf{x}\right)bold_y ← NonCausalLM ( bold_x )
        for all m𝑚absentm\initalic_m ∈ masked do
           prob, idx Filter(𝐲[m])absentFilter𝐲delimited-[]𝑚\leftarrow\texttt{{Filter}}\left(\mathbf{y}\left[m\right]\right)← Filter ( bold_y [ italic_m ] )
           𝐱[m]Sample𝐱delimited-[]𝑚Sample\mathbf{x}\left[m\right]\leftarrow\texttt{{Sample}}bold_x [ italic_m ] ← Sample(prob, idx)
        end for
        jj+N𝑗𝑗𝑁j\leftarrow j+Nitalic_j ← italic_j + italic_N
     end while
     i++i++italic_i + +
  end while

The number of iterations I𝐼Iitalic_I and the number of tokens masked in each generation step N𝑁Nitalic_N must be predetermined. These parameters influence the performance and computational efficiency of the algorithm. More masked tokens per generation step mean fewer calls to the language model function (KN𝐾𝑁\lceil\frac{K}{N}\rceil⌈ divide start_ARG italic_K end_ARG start_ARG italic_N end_ARG ⌉ calls per iteration), resulting in improved computing efficiency. In this work we set N=2𝑁2N=2italic_N = 2 and I=30𝐼30I=30italic_I = 30.

III-C Automatic evaluation

The relative entropy 𝒟KL(PQ)\mathcal{D}_{KL}\left(P\mid\mid Q\right)caligraphic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P ∣ ∣ italic_Q ), also known as KL divergence, quantifies the expected increase in uncertainty that comes from modeling a reference distribution P𝑃Pitalic_P as another distribution Q𝑄Qitalic_Q. In this study, we use the following formulation for the conditional relative entropy for both causal (X=Xc𝑋subscript𝑋𝑐X=X_{c}italic_X = italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT) and non-causal (X=Xn𝑋subscript𝑋𝑛X=X_{n}italic_X = italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) contexts:

𝒟KL(P(X^X)Q(X^X))=𝐱𝒳Nq(𝐱)x^𝒱p(x^X=𝐱)logp(x^X=𝐱)q(x^X=𝐱)\begin{split}\mathcal{D}_{KL}\left(P(\hat{X}\mid X)\mid\mid Q(\hat{X}\mid X)% \right)=&\sum_{\mathbf{x}\in\mathcal{X}^{N}}q(\mathbf{x})\sum_{\hat{x}\in% \mathcal{V}}p(\hat{x}\mid X=\mathbf{x})\\ &\log\frac{p(\hat{x}\mid X=\mathbf{x})}{q(\hat{x}\mid X=\mathbf{x})}\end{split}start_ROW start_CELL caligraphic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_P ( over^ start_ARG italic_X end_ARG ∣ italic_X ) ∣ ∣ italic_Q ( over^ start_ARG italic_X end_ARG ∣ italic_X ) ) = end_CELL start_CELL ∑ start_POSTSUBSCRIPT bold_x ∈ caligraphic_X start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_q ( bold_x ) ∑ start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG ∈ caligraphic_V end_POSTSUBSCRIPT italic_p ( over^ start_ARG italic_x end_ARG ∣ italic_X = bold_x ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_log divide start_ARG italic_p ( over^ start_ARG italic_x end_ARG ∣ italic_X = bold_x ) end_ARG start_ARG italic_q ( over^ start_ARG italic_x end_ARG ∣ italic_X = bold_x ) end_ARG end_CELL end_ROW (13)

where p𝑝pitalic_p and q𝑞qitalic_q are the conditional probability density functions of our reference textual dataset’s POS tags and the sequences to evaluate, respectively.

III-D Manual evaluation

The annotators were asked yes/no questions on the following aspects to assess the quality of the generated sequences:

  • Q1. Concordance, penalizing improper use of verb tenses, number, and, if applicable, gender of determinants, adjectives, nouns, and pronouns.

  • Q2. Syntactic structure correctness, by checking if all sequences have at least one subject and one verb and assessing that the sentences are syntactically sound in general.

  • Q3. Word or phrase-level repetitions, by penalizing word redundancy, duplication in enumerations, and subject redundancy, while trying to respect those repetitions that may be considered stylistic choices.

  • Q4. Word sense. Language models can generate new words by combining prefixes, suffixes, and pronouns as sequential tokens. This question penalizes nonsensical words.

We assessed inter-agreement with accuracy and α𝛼\alphaitalic_α-reliability [49] to verify that the annotations were neither arbitrary (acc=α=0𝑎𝑐𝑐𝛼0acc=\alpha=0italic_a italic_c italic_c = italic_α = 0) nor redundant (acc=α=1𝑎𝑐𝑐𝛼1acc=\alpha=1italic_a italic_c italic_c = italic_α = 1).

Finally, we also included a more general rating question (Q5) in which we asked annotators to provide a numerical score between 1 and 5 based on their impression of the annotation and their experience.

IV Predictability test results

We used two different experimental setups to perform the hypothesis test in Section III-A: (i) A first setup using two data sets, one in English and another in Spanish, which are exact translations of each other (Section IV-A); and (ii) another setup with larger, relatively similar English and Spanish data sets that cannot be considered exact translations of each other (Section IV-B). Both setups have advantages and disadvantages. It is more desirable to compare parallel content (setup #1), but bigger data volumes allow for analyzing lengthier contexts (setup #2). As mentioned, we preprocessed the data sets with a POS tagger. The POS tagging module used the spaCy es_core_news_sm222Available at: https://spacy.io/models/es, June 2024. and en_core_web_sm333Available at: https://spacy.io/models/en, June 2024. pipelines for Spanish and English, respectively. In order to balance the categories, we reduced the original seventeen Universal POS tags444Available at: https://universaldependencies.org/u/pos, June 2024. to the following nine: adjectives (ADJ), adpositions (ADP), adverbs (ADV), conjunctions (CONJ), determiners (DET), nouns (NOUN), pronouns (PRON), verbs (VERB), and a last category combining unknown words, interjections, blank spaces, punctuation marks, and symbols (OTHER).

We executed the experiments using two Nvidia A100 GPUs with the specifications in Table I.

Table I: Specifications of the GPUs.
Nvidia A100-PCIE-40GB specifications
CUDA Driver/Runtime Version 11.5/11.2
CUDA Capability Version 8.0
Memory 40 GB (HBM2 bw: 1555 GB/s)
Multiprocessors 108
CUDA Cores 6912 (64 per MP)
GPU Max Clock rate: 1.41 GHz

IV-A Tale data sets

Table II: Tale data set sources.
Source Sprache
Ciudad Seva555Available at: https://ciudadseva.com, June 2024. Spanish
Rincón Castellano666Available at: https://www.rinconcastellano.com, June 2024. Spanish
Elejandría777Available at: https://www.elejandria.com, June 2024. Spanish
Andersenstories.com888Available at: https://www.andersenstories.com, June 2024. Spanish & English
Grimmstories.com999Available at: https://www.grimmstories.com, June 2024. Spanish & English
Americanliterature.com101010Available at: https://americanliterature.com, June 2024. Englisch
D. L. Ashliman’s compilation111111Available at: https://sites.pitt.edu/~dash/perrault.html, June 2024. Englisch
Long long time ago121212Available at: https://www.longlongtimeago.com, June 2024. Englisch
Project Gutenberg131313Available at: https://www.gutenberg.org, June 2024. Englisch
The H.P. Lovecraft Archive141414Available at: https://www.hplovecraft.com, June 2024. Englisch

The tale data sets of setup #1 consisted of public domain tales, short novels, and fables with Creative Commons-licensed translations. We crawled the English and Spanish text collections for the sizes of the respective datasets to be identical (3.7M words of raw text each) so that they could be considered direct translations, from Portable Document Format (PDF) with the Python pdftotext151515Available at: https://pypi.org/project/pdftotext, June 2024. library and by web scrapping using Scrapy161616Available at: https://scrapy.org, June 2024. web spiders. Table II shows all the data sources.

We could not compute conditional entropy values for contexts longer than two words due to data set size constraints. However, because of the similarities in content between English and Spanish data, we could efficiently study short-term grammatical dependencies in both languages. In the experiment, we compared the causal two-word context with a specific case of non-causal context in which the predicted word lies between the two contextual words. We provide a brief qualitative analysis of the contexts that resulted in lower conditional entropy values for the predicted term X^isubscript^𝑋𝑖\hat{X}_{i}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for causal and bidirectional contexts in both languages.

Table III: Two-word context-conditioned entropy in bits per tag, tale data set.
Sprache H(X^iXi2,Xi1)¯¯𝐻conditionalsubscript^𝑋𝑖subscript𝑋𝑖2subscript𝑋𝑖1\,\overline{\!{H(\hat{X}_{i}\mid X_{i-2},X_{i-1})}}over¯ start_ARG italic_H ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) end_ARG H(X^iXi1,Xxi+1)¯¯𝐻conditionalsubscript^𝑋𝑖subscript𝑋𝑖1subscript𝑋subscript𝑥𝑖1\,\overline{\!{H(\hat{X}_{i}\mid X_{i-1},X_{x_{i}+1})}}over¯ start_ARG italic_H ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ) end_ARG
Spanish 2.3331 1.8444
Englisch 2.2193 1.9726

Table III shows the results of hypothesis testing on English and Spanish tale data sets. They are coherent with our initial intuition that Spanish is more suited for non-causal text prediction than English. However, by examining Table III row-wise, middle tag prediction seemed to have lower entropy than causal text prediction in both languages. This does not necessarily mean that non-causal NLG outperformed its causal counterpart, as this experiment disregarded relevant factors, such as initial text generation steps.

Table IV: Low entropy causal contexts, Spanish tale data set.
(Xi2,Xi1)subscript𝑋𝑖2subscript𝑋𝑖1\left(X_{i-2},X_{i-1}\right)( italic_X start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) MaxProb(X^i)𝑀𝑎𝑥𝑃𝑟𝑜𝑏subscript^𝑋𝑖MaxProb\left(\hat{X}_{i}\right)italic_M italic_a italic_x italic_P italic_r italic_o italic_b ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) H(X^iXi2,Xi1)𝐻conditionalsubscript^𝑋𝑖subscript𝑋𝑖2subscript𝑋𝑖1H(\hat{X}_{i}\mid X_{i-2},X_{i-1})italic_H ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT )
(ADP, DET) NOUN 0.8567
(DET, DET) NOUN 0.9571
(VERB, DET) NOUN 0.9985
(ADV, PRON) VERB 1.0101
(PRON, PRON) VERB 1.0709
(ADV, DET) NOUN 1.0919
(CONJ, DET) NOUN 1.1460
(ADJ, DET) NOUN 1.2238
(OTHER, DET) NOUN 1.3036
(CONJ, PRON) VERB 1.3081
(NOUN, PRON) VERB 1.4421
(NOUN, DET) NOUN 1.4576
(DET, ADJ) NOUN 1.5038
Table V: Low entropy bidirectional contexts, Spanish tale data set.
(Xi1,Xi+1)subscript𝑋𝑖1subscript𝑋𝑖1\left(X_{i-1},X_{i+1}\right)( italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) MaxProb(X^i)𝑀𝑎𝑥𝑃𝑟𝑜𝑏subscript^𝑋𝑖MaxProb\left(\hat{X}_{i}\right)italic_M italic_a italic_x italic_P italic_r italic_o italic_b ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) H(X^iXi1,Xi+1)𝐻conditionalsubscript^𝑋𝑖subscript𝑋𝑖1subscript𝑋𝑖1H(\hat{X}_{i}\mid X_{i-1},X_{i+1})italic_H ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT )
(DET, ADP) NOUN 0.3316
(DET, OTHER) NOUN 0.4059
(DET, CONJ) NOUN 0.5770
(ADP, NOUN) DET 0.7159
(DET, ADJ) NOUN 0.7238
(DET, ADV) NOUN 0.8258
(PRON, ADP) VERB 0.8354
(DET, PRON) NOUN 0.8426
(DET, VERB) NOUN 0.9659
(NOUN, CONJ) OTHER 1.2998
(PRON, DET) VERB 1.3234
(ADJ, CONJ) OTHER 1.3343
(PRON, ADV) VERB 1.3719
(ADP, OTHER) NOUN 1.3813
(VERB, NOUN) DET 1.4323
(ADP, CONJ) NOUN 1.4736
(CONJ, NOUN) DET 1.4812
(PRON, OTHER) VERB 1.5257
(ADP, ADP) NOUN 1.5785
Table VI: Low entropy causal contexts, English tale data set.
(Xi2,Xi1)subscript𝑋𝑖2subscript𝑋𝑖1\left(X_{i-2},X_{i-1}\right)( italic_X start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) MaxProb(X^i)𝑀𝑎𝑥𝑃𝑟𝑜𝑏subscript^𝑋𝑖MaxProb\left(\hat{X}_{i}\right)italic_M italic_a italic_x italic_P italic_r italic_o italic_b ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) H(X^iXi2,Xi1)𝐻conditionalsubscript^𝑋𝑖subscript𝑋𝑖2subscript𝑋𝑖1H(\hat{X}_{i}\mid X_{i-2},X_{i-1})italic_H ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT )
(ADV, PRON) VERB 1.0253
(NOUN, DET) NOUN 1.0716
(ADP, DET) NOUN 1.1451
(CONJ, PRON) VERB 1.2328
(OTHER, PRON) VERB 1.2841
(ADJ, ADJ) NOUN 1.2921
(VERB, DET) NOUN 1.3121
(DET, ADJ) NOUN 1.3140
(CONJ, DET) NOUN 1.3340
(OTHER, DET) NOUN 1.3978
(ADV, DET) NOUN 1.4669
(PRON, DET) NOUN 1.4800
(NOUN, NOUN) NOUN 1.5256
Table VII: Low entropy bidirectional contexts, English tale data set.
(Xi1,Xi+1)subscript𝑋𝑖1subscript𝑋𝑖1\left(X_{i-1},X_{i+1}\right)( italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) MaxProb(X^i)𝑀𝑎𝑥𝑃𝑟𝑜𝑏subscript^𝑋𝑖MaxProb\left(\hat{X}_{i}\right)italic_M italic_a italic_x italic_P italic_r italic_o italic_b ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) H(X^iXi1,Xi+1)𝐻conditionalsubscript^𝑋𝑖subscript𝑋𝑖1subscript𝑋𝑖1H(\hat{X}_{i}\mid X_{i-1},X_{i+1})italic_H ( over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT )
(DET, ADP) NOUN 0.3183
(DET, VERB) NOUN 0.5168
(DET, ADV) NOUN 0.5461
(DET, OTHER) NOUN 0.5776
(DET, CONJ) NOUN 0.6541
(ADJ, ADP) NOUN 0.9530
(ADJ, OTHER) NOUN 0.9721
(NOUN, CONJ) OTHER 1.0294
(DET, PRON) NOUN 1.0911
(ADJ, CONJ) NOUN 1.3138
(ADP, ADJ) DET 1.3873
(CONJ, VERB) PRON 1.4157
(PRON, ADV) VERB 1.4580
(NOUN, OTHER) NOUN 1.5198
Refer to caption
(a)
Refer to caption
(b)
Figure 1: Spanish tale data set low-entropy context pattern distribution. (a) Causal. (b) Bidirectional.
Refer to caption
(a)
Refer to caption
(b)
Figure 2: English tale data set low-entropy context pattern distribution. (a) Causal. (b) Bidirectional.

Tables IV, V, VI and VII show left and bidirectional context-predicted word pairs with entropy lower than log(3)1.58531.585\log(3)\approx 1.585roman_log ( 3 ) ≈ 1.585. We can note that the number of combinations satisfying this condition for the Spanish data set was much higher for bidirectional contexts than for causal contexts. Figure 1 shows that, for Spanish, causal patterns were also less diverse, as all of them relied on either pronoun/verb or determinant/noun grammatical dependencies.

As shown in tables VI and VII, causal and bidirectional low entropy contexts in English were more balanced. Figure 2 reflects the prevalence of the adjective/noun dependencies (especially for the causal case). Conversely, adjectives have a much less rigid position in Spanish within the sentence.

IV-B Wikidumps data sets

The data sets of setup #2 comprised a collection of Wikipedia articles from Wikimedia’s Spanish171717Available at: https://dumps.wikimedia.org/eswiki/20220801, June 2024. and English181818Available at: https://dumps.wikimedia.org/enwiki/20220801, June 2024. dumps. We extracted and cleaned textual data from these dumps using WikiExtractor191919Available at: https://github.com/attardi/wikiextractor, June 2024.. Then we loaded and mapped the resulting JSON files with HuggingFace’s Datasets202020Available at: https://github.com/huggingface/datasets, June 2024. library. We picked one million random non-empty articles in each language for hypothesis testing.

The amount of data available allowed accurately computing the conditional entropy for longer contexts than in the previous case, yielding average conditional entropy results for contexts up to six words. For this experiment, we explored all possible non-causal contexts and assessed the impact of the location of the predicted tag on our predictability results.

Table VIII: Conditional entropy in bits per tag, Wikidump data set.
Context Sprache H(X^Xc)¯¯𝐻conditional^𝑋subscript𝑋𝑐\,\overline{\!{H(\hat{X}\mid X_{c})}}over¯ start_ARG italic_H ( over^ start_ARG italic_X end_ARG ∣ italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_ARG H(X^Xn)¯¯𝐻conditional^𝑋subscript𝑋𝑛\,\overline{\!{H(\hat{X}\mid X_{n})}}over¯ start_ARG italic_H ( over^ start_ARG italic_X end_ARG ∣ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG H(X^XnXc)¯¯𝐻conditional^𝑋subscript𝑋𝑛subscript𝑋𝑐\,\overline{\!{H(\hat{X}\mid X_{n}\setminus X_{c})}}over¯ start_ARG italic_H ( over^ start_ARG italic_X end_ARG ∣ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∖ italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_ARG
N=2𝑁2N=2italic_N = 2 Spanish 2.4981 2.3192 1.9609
Englisch 2.3335 2.2306 2.0255
N=3𝑁3N=3italic_N = 3 Spanish 2.4258 2.1286 1.8313
Englisch 2.2903 2.1077 1.9257
N=4𝑁4N=4italic_N = 4 Spanish 2.3686 1.9819 1.7247
Englisch 2.2559 2.0158 1.8544
N=5𝑁5N=5italic_N = 5 Spanish 2.3412 1.8252 1.5679
Englisch 2.2329 1.9399 1.7942
N=6𝑁6N=6italic_N = 6 Spanish 2.2817 1.6351 1.3758
Englisch 2.1846 1.8599 1.7200
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 3: Average conditional entropy by predicted tag location for different context lengths. (a) N=3𝑁3N=3italic_N = 3. (b) N=4𝑁4N=4italic_N = 4. (c) N=5𝑁5N=5italic_N = 5. (d) N=6𝑁6N=6italic_N = 6.

Table VIII shows the average causal and non-causal conditional entropy results for the Wikidumps data set. Compared to the tale data set with tighter equivalence between languages, Spanish conditional entropy values often exceeded those for English. For shorter contexts, even non-causal conditional entropy was higher in the Spanish data set than in the English data set. For this reason, we report average entropy as conditioned by the set of non-causal contexts by subtracting the causal context, which is always lower in Spanish than English, highlighting the lower efficiency of left-to-right word prediction in Spanish.

We can further appreciate this effect in Figure 3, which depicts the average conditional entropy for all possible predicted tag locations within contexts of diverse lengths. As we can see, the more balanced the left and right contexts are, the more predictable the grammatical category in both languages. However, this tendency was considerably more noticeable in Spanish and became more evident as the context was longer. Since the average number of words per sentence tends to be higher in Spanish than in English [50, 51, 52], Spanish non-causal NLG is even more promising.

V Text generation results

For the text generation experiment, we used four different language models: Deep ESP’s Spanish GPT-2212121Available at: https://huggingface.co/DeepESP/gpt2-spanish, June 2024. and University of Chile’s Spanish BERT222222Available at: https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased, June 2024. for Spanish causal and non-causal language modeling, respectively, and OpenAI’s GPT-2 small232323Available at: https://huggingface.co/gpt2, June 2024. and Google’s BERT base242424Available at: https://huggingface.co/bert-base-cased, June 2024. for English causal and non-causal language modeling, respectively.

Therefore, we used the small versions of these language models, which range from 110M to 117M parameters. We are aware that these models are not on par with state-of-the-art generative language models, but our goal was not to achieve the best text generation results but rather to compare the performance of causal and non-causal language models with similar characteristics. The chosen models were comparable in both training data, number of parameters, and vocabulary size, thus appropriate for this experiment.

We fine-tuned the models on the tales data set described in Section IV-A using HuggingFace’s Transformers252525Available at: https://github.com/huggingface/transformers, June 2024. library. The training was executed in 10 epochs with batch size 8 using a Nvidia A100-PCIE-40GB (see Table I).

Then, we generated 1,000 different 50-token sequences for each of the four language models. We assessed the generation performance of each language model using both automatic (Section V-A) and manual (Section V-B) evaluation metrics.

V-A Automatic evaluation

In the sequel, we will call “opposite” the language, either Spanish or English, for which the model was not trained for an experiment. We used the conditional relative entropy metric described in Section III-C for automatic evaluation to compare the produced sequences from the tales data sets in the target and opposite languages. This was possible because language-independent POS tagging was used for tokenization.

As expected, Table IX reveals that the sequences generated by all four models adhered more closely to their respective target language’s POS tag distributions. There were no significant differences between causal and non-causal context-conditioned relative entropy results.

Given its conditional relative entropy values one order of magnitude greater than those of the other three models, English BERT performed significantly worse in terms of adherence to the target language (similar-to\sim0.02 vs similar-to\sim0.1), which is consistent with the state of the art [53, 54]. For Spanish, the results of causal and non-causal NLG are more comparable. The Spanish BERT non-causal language model had the lowest conditional relative entropy, outperforming even English GPT-2 when considering adherence to the respective target languages.

Table IX: Conditional relative entropy results.“Causal” and “non-causal” refer to causal and non-causal context-conditioned relative entropy, respectively.
Spanish Data Set English Data Set
causal non-causal causal non-causal
Model
Spanish GPT-2 0.0351 0.0387 1.1903 1.2588
Spanish BERT 0.0210 0.0221 1.3512 1.4049
English GPT-2 1.0096 1.5505 0.0245 0.0292
English BERT 0.8817 1.4437 0.1010 0.1050

These results show that causal models corresponded more closely to English grammar and non-causal models corresponded more closely to Spanish grammar (by considering grammar as reflected in the English and Spanish datasets). Note that this also held when using the models of the opposite language. That is, the conditional relative entropy of English Bert for Spanish was lower than the conditional relative entropy of English GPT-2 for Spanish, and the conditional relative entropy of Spanish GPT-2 for English was lower than the conditional relative entropy of Spanish BERT for English.

V-B Manual evaluation

We chose 250 sequences at random for each pairing of language (Spanish or English) and model (BERT or GPT-2) to reduce the annotation load while still obtaining useful insights. Each sequence was examined independently using the questions in Section III-D.

Five annotators participated in the evaluation. Table X shows the global inter-agreement analysis of yes/no replies. Using the thresholds by [55], α𝛼\alphaitalic_α-reliability coefficients lay between fair and moderate agreement. Accuracy values were also acceptable for all questions. The grammatical structure (Q2) was the most controversial aspect of the first four questions due to different opinions on linguistic demand, as some annotators were more lenient with one of the languages.

Table X: Global inter-agreement metrics for yes/no questions.
Q1 Q2 Q3 Q4
Accuracy 0.906 0.818 0.829 0.966
α𝛼\alphaitalic_α-reliability 0.296 0.242 0.442 0.491
Table XI: Manual evaluation results for yes/no questions.
Q1 Q2 Q3 Q4
Model (‘yes’ % ) (‘yes’ % ) (‘no’ % ) (‘yes’ % )
Spanish GPT-2 91.6 90.8 77.6 96.4
Spanish BERT 95.2 91.6 85.6 98.0
English GPT-2 99.2 93.6 89.2 98.4
English BERT 96.4 89.2 84.8 94.8

Next, we can see that the manual assessment scores for all four questions in Table XI were consistent with the automatic metrics: BERT was considered to perform better in Spanish and GPT-2 to perform better in English. The best outcome for Spanish language models was in word sense (Q4), whereas English language models scored better in word concordance (Q1). This is consistent with the fact that there is no gender concordance in English. For all language models, the more challenging question was word repetition (Q3).

Table XII: Manual evaluation results, general assessment question.
Model Average Norm. Average
Spanish GPT-2 3.682 0.920
Spanish BERT 4.124 1.034
English GPT-2 4.301 1.078
English BERT 3.870 0.967

For the more subjective fifth question, as the average rating from one annotator to another ranged from 3.5293.5293.5293.529 to 4.314.314.314.31, we normalized the scores of each annotator for an average of 1.0. The results of Table XII indicate that the subjectively perceived quality of Spanish texts generated by BERT is higher than when using GPT-2, vice versa in the case of English, which is consistent with our initial intuition and all the results so far.

VI Discussion

All the results of the previous section are aligned with our initial intuition that Spanish is more suited for non-causal language modeling than English. As shown in Table XIII, natural English text was demonstrated to be more predictable than text in Spanish given a causal context in the predictability test results, by a relatively constant margin of similar-to\sim5%. However, given a non-causal context, Spanish was more predictable than English, by an increasing margin as the context got longer.

In the automatic evaluation with conditional relative entropy, Spanish BERT showed the highest adherence to its target language grammar.

These results are consistent with the text generation ranking summarized in Table XIV (whose row “automatic evaluation” reflects the conditional relative entropies in Table IX), as English GPT-2 performed better than English BERT, and Spanish BERT better than Spanish GPT-2 in all the evaluation experiments. Manual evaluation consistently ranked English GPT-2 and Spanish BERT as the best language models for NLG. Spanish BERT ranked worse than English BERT in concordance, but, as previously stated, this results might be biased by the lack of gender concordance in English.

Overall, the results of our experiments show that non-causal language modeling is more promising for Spanish NLG than for English.

Table XIII: Predictability test, automatic evaluation results summary
Highest predictability
Data set Context length H(X^Xc)¯1superscript¯𝐻conditional^𝑋subscript𝑋𝑐1\,\overline{\!{H(\hat{X}\mid X_{c})}}^{-1}over¯ start_ARG italic_H ( over^ start_ARG italic_X end_ARG ∣ italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT H(X^XnXc)¯1superscript¯𝐻conditional^𝑋subscript𝑋𝑛subscript𝑋𝑐1\,\overline{\!{H(\hat{X}\mid X_{n}\setminus X_{c})}}^{-1}over¯ start_ARG italic_H ( over^ start_ARG italic_X end_ARG ∣ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∖ italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
Tales N=2𝑁2N=2italic_N = 2 English (+5.13%percent5.13+5.13\%+ 5.13 %) Spanish(+6.95%percent6.95+6.95\%+ 6.95 %)
Wikidumps N=2𝑁2N=2italic_N = 2 English (+7.05%percent7.05+7.05\%+ 7.05 %) Spanish (+3.29%percent3.29+3.29\%+ 3.29 %)
N=3𝑁3N=3italic_N = 3 English (+5.92%percent5.92+5.92\%+ 5.92 %) Spanish (+5.15%percent5.15+5.15\%+ 5.15 %)
N=4𝑁4N=4italic_N = 4 English (+4.99%percent4.99+4.99\%+ 4.99 %) Spanish (+7.52%percent7.52+7.52\%+ 7.52 %)
N=5𝑁5N=5italic_N = 5 English (+4.85%percent4.85+4.85\%+ 4.85 %) Spanish (+14.43%percent14.43+14.43\%+ 14.43 %)
N=6𝑁6N=6italic_N = 6 English (+4.45%percent4.45+4.45\%+ 4.45 %) Spanish (+25.02%percent25.02+25.02\%+ 25.02 %)
Table XIV: Text generation ranking, manual evaluation results summary
Spanish Englisch
Experiment GPT-2 BERT GPT-2 BERT
Automatic evaluation #3 #1 #2 #4
Q1. Concordance #4 #3 #1 #2
Q2. Syntactic structure #3 #2 #1 #4
Q3. Repetitions #4 #2 #1 #3
Q4. Word sense #3 #2 #1 #4
Q5. General rating #4 #2 #1 #3

VII Conclusions

In this paper, we have first assessed English and Spanish predictability given causal and non-causal contexts, demonstrating that Spanish is more predictable given a non-causal context.For this purpose, we developed and computed a novel metric of the average causal and non-causal context-conditioned entropies of the grammatical categories present in similar and strictly parallel English and Spanish textual data sets. The experiments have shown that average causal context-conditioned entropy is higher in Spanish texts than in English texts, and that average non-causal context-conditioned entropy is higher in English texts than in Spanish ones. This was further supported by a a study of the grammatical dependencies that are more predictable in each language and how word location within a context influences predictability.

Following the validation of the hypothesis about the relation between causal- and non-causal contexts and language predictability, we selected causal and non-causal language generators based in Spanish and English models to analytically assess their quality depending on the target language to generate. To make experiments comparable, we chose similarly dimensioned unidirectional and bidirectional pre-trained transformer language models and fine-tuned them using highly equivalent Spanish and English data sets.

Finally, we evaluated the outcome both analytically and manually to assess the performance of text generation in all test scenarios. In the first case, to compare the compliance of the language models with the grammatical structure of their target languages, we have proposed a conditional relative entropy metric. Manual evaluation, which was validated using inter-agreement metrics, is coherent with the automatic evaluation, validating it.

The insights of this study suggest the interest of further research into analyses of language predictability in languages other than English, as well as on efficient text production using bidirectional transformers in Spanish and other languages with similar grammatical structures.

References

  • [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems.   MIT Press, 2017, pp. 1–11.
  • [2] T. Lin, Y. Wang, X. Liu, and X. Qiu, “A Survey of Transformers,” AI Open, pp. 111–132, 2022.
  • [3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” in Advances in Neural Information Processing Systems.   MIT Press, 2014, pp. 1–9.
  • [4] I. Garrido-Muñoz, A. Montejo-Ráez, F. Martínez-Santiago, and L. A. Ureña-López, “A Survey on Bias in Deep NLP,” Applied Sciences, vol. 11, no. 7, pp. 1–3184, 2021.
  • [5] J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Nikolaev, and J. Palomaki, “TyDiQA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 454–470, 2020.
  • [6] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual Denoising Pre-Training for Neural Machine Translation,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020.
  • [7] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.   Association for Computational Linguistics, 2021, pp. 483–498.
  • [8] S. Wu and M. Dredze, “Are All Languages Created Equal in Multilingual BERT?” in Proceedings of the Workshop on Representation Learning for NLP.   Association for Computational Linguistics, 2020, pp. 120–130.
  • [9] P. Rust, J. Pfeiffer, I. Vulić, S. Ruder, and I. Gurevych, “How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing.   Association for Computational Linguistics, 2021, pp. 3118–3135.
  • [10] E. Erdem, M. Kuyu, S. Yagcioglu, A. Frank, L. Parcalabescu, B. Plank, A. Babii, O. Turuta, A. Erdem, I. Calixto, E. Lloret, E.-S. Apostol, C.-O. Truică, B. Šandrih, S. Martinčić-Ipšić, G. Berend, A. Gatt, and G. Korvel, “Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning,” Journal of Artificial Intelligence Research, vol. 73, pp. 1131–1207, 2022.
  • [11] M. Artetxe and H. Schwenk, “Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 597–610, 2019.
  • [12] J. H. Clark, D. Garrette, I. Turc, and J. Wieting, “Canine : Pre-Training an Efficient Tokenization-Free Encoder for Language Representation,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 73–91, 2022.
  • [13] R. Guarasci, S. Silvestri, G. D. Pietro, H. Fujita, and M. Esposito, “BERT Syntactic Transfer: A Computational Experiment on Italian, French and English Languages,” Computer Speech & Language, vol. 71, pp. 1–19, 2022.
  • [14] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems.   MIT Press, 2020, pp. 1–25.
  • [15] L. D. Mattei, M. Cafagna, F. Dell’Orletta, M. Nissim, and M. Guerini, “GePpeTto Carves Italian into a Language Model,” in Proceedings of the Italian Conference on Computational Linguistics, vol. 2769.   Accademia University Press, 2020, pp. 136–143.
  • [16] A. Gutiérrez-Fandiño, J. Armengol-Estapé, M. Pàmies, J. Llop-Palao, J. Silveira-Ocampo, C. P. Carrino, C. Armentano-Oller, C. Rodriguez-Penagos, A. Gonzalez-Agirre, and M. Villegas, “MarIA: Spanish Language Models,” Procesamiento del Lenguaje Natural, vol. 68, no. 0, pp. 39–60, 2022.
  • [17] F. Davis and M. van Schijndel, “Recurrent Neural Network Language Models Always Learn English-Like Relative Clause Attachment,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics.   Association for Computational Linguistics, 2020, pp. 1979–1990.
  • [18] K. Lahousse and B. Lamiroy, “Word Order in French, Spanish and Italian: A Grammaticalization Account,” Folia Linguistica, vol. 46, pp. 387–415, 2012.
  • [19] A. Assaiqeli, M. Maniam, and M. Farrah, “Inversion and Word Order in English: A Functional Perspective,” Studies in English Language and Education, vol. 8, pp. 523–545, 2021.
  • [20] K. Chen, R. Wang, M. Utiyama, E. Sumita, T. Zhao, M. Yang, and H. Zhao, “Towards more diverse input representation for neural machine translation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1586–1597, 2020.
  • [21] E. Wu, “Learning Accurate Integer Transformer Machine-Translation Models,” SN Computer Science, vol. 2, pp. 1–8, 2020.
  • [22] Y. Kawara, C. Chu, and Y. Arase, “Preordering encoding on transformer for translation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 644–655, 2021.
  • [23] T. Nguyen, L. Nguyen, P. Tran, and H. Nguyen, “Improving Transformer-Based Neural Machine Translation with Prior Alignments,” Complexity, vol. 2021, pp. 1–10, 2021.
  • [24] C. Zeng and S. Li, “Analyzing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets,” Wireless Communications and Mobile Computing, vol. 2021, pp. 1–17, 2021.
  • [25] Y. Bai, J. Yi, J. Tao, Z. Tian, Z. Wen, and S. Zhang, “Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1897–1911, 2021.
  • [26] N. Chen, S. Watanabe, J. Villalba, P. Zelasko, and N. Dehak, “Non-Autoregressive Transformer for Speech Recognition,” IEEE Signal Processing Letters, vol. 28, pp. 121–125, 2021.
  • [27] C. Wang, S. Dai, Y. Wang, F. Yang, M. Qiu, K. Chen, W. Zhou, and J. Huang, “ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1207–1218, 2022.
  • [28] M. Kaneko, “Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction,” Journal of Natural Language Processing, vol. 27, pp. 683–687, 2020.
  • [29] D. Park and C. W. Ahn, “Self-Supervised Contextual Data Augmentation for Natural Language Processing,” Symmetry, vol. 11, pp. 1–1393, 2019.
  • [30] V. Balaraman and B. Magnini, “Domain-aware dialogue state tracker for multi-domain dialogue systems,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 866–873, 2021.
  • [31] S. Yu, Y. Chen, and H. Zaidi, “AVA: A Financial Service Chatbot Based on Deep Bidirectional Transformers,” Frontiers in Applied Mathematics and Statistics, vol. 7, pp. 1–33, 2021.
  • [32] C. Song, Z. Qu, N. Blumm, and A.-L. Barabási, “Limits of Predictability in Human Mobility,” Science, vol. 327, pp. 1018–1021, 2010.
  • [33] G. Li, V. L. Knoop, and H. van Lint, “Estimate the Limit of Predictability in Short-Term Traffic Forecasting: An Entropy-Based Approach,” Transportation Research Part C: Emerging Technologies, vol. 138, pp. 1–18, 2022.
  • [34] J. Hale, “A Probabilistic Earley Parser as a Psycholinguistic Model,” in Proceedings of the North American Chapter of the Association for Computational Linguistics on Language technologies.   Association for Computational Linguistics, 2001, pp. 1–8.
  • [35] R. Levy, “Expectation-Based Syntactic Comprehension,” Cognition, vol. 106, pp. 1126–1177, 2008.
  • [36] M. W. Lowder, W. Choi, F. Ferreira, and J. M. Henderson, “Lexical Predictability During Natural Reading: Effects of Surprisal and Entropy Reduction,” Cognitive Science, vol. 42, pp. 1166–1183, 2018.
  • [37] J. M. Henderson, W. Choi, M. W. Lowder, and F. Ferreira, “Language Structure in the Brain: A Fixation-Related fMRI Study of Syntactic Surprisal in Reading,” NeuroImage, vol. 132, pp. 293–300, 2016.
  • [38] E. Gibson, “Linguistic Complexity: Locality of Syntactic Dependencies,” Cognition, vol. 68, pp. 1–76, 1998.
  • [39] R. L. Lewis and S. Vasishth, “An Activation-Based Model of Sentence Processing as Skilled Memory Retrieval,” Cognitive Science, vol. 29, pp. 375–419, 2005.
  • [40] B. Bartek, R. L. Lewis, S. Vasishth, and M. R. Smith, “In Search of On-Line Locality Effects in Sentence Comprehension,” Journal of Experimental Psychology: Learning, Memory, and Cognition, vol. 37, pp. 1178–1198, 2011.
  • [41] B. Nicenboim, S. Vasishth, C. Gattei, M. Sigman, and R. Kliegl, “Working Memory Differences in Long-Distance Dependency Resolution,” Frontiers in Psychology, vol. 6, pp. 1–312, 2015.
  • [42] B. Nicenboim, P. Logačev, C. Gattei, and S. Vasishth, “When High-Capacity Readers Slow Down and Low-Capacity Readers Speed Up: Working Memory and Locality Effects,” Frontiers in Psychology, vol. 7, pp. 1–280, 2016.
  • [43] R. Futrell, E. Gibson, and R. P. Levy, “Lossy‐Context Surprisal: An Information‐Theoretic Model of Memory Effects in Sentence Processing,” Cognitive Science, vol. 44, pp. 1–54, 2020.
  • [44] S. Vasishth, K. Suckow, R. L. Lewis, and S. Kern, “Short-Term Forgetting in Sentence Comprehension: Crosslinguistic Evidence From Verb-Final Structures,” Language and Cognitive Processes, vol. 25, pp. 533–567, 2010.
  • [45] S. L. Frank, T. Trompenaars, and S. Vasishth, “Cross-Linguistic Differences in Processing Double-Embedded Relative Clauses: Working-Memory Constraints or Language Statistics?” Cognitive Science, vol. 40, pp. 554–578, 2016.
  • [46] J. Pater, “Generative Linguistics and Neural Networks at 60: Foundation, Friction, and Fusion,” Language, vol. 95, no. 1, pp. e41–e74, 2019.
  • [47] M. J. Hofmann, S. Remus, C. Biemann, R. Radach, and L. Kuchinke, “Language Models Explain Word Reading Times Better Than Empirical Predictability,” Frontiers in Artificial Intelligence, vol. 4, 2022.
  • [48] T. Linzen, “What Can Linguistics and Deep Learning Contribute to Each Other? Response to Pater,” Language, vol. 95, pp. e99–e108, 2019.
  • [49] K. Krippendorff, Content Analysis: An Introduction to Its Methodology.   Sage Publications, Inc., 2012.
  • [50] C. H. Bjornsson, “Readability of Newspapers in 11 Languages,” Reading Research Quarterly, vol. 18, pp. 1–480, 1983.
  • [51] M. R. Montaño-Harmon and M. R. Montano-Harmon, “Discourse Features of Written Mexican Spanish: Current Research in Contrastive Rhetoric and Its Implications,” Hispania, vol. 74, pp. 1–417, 1991.
  • [52] J. M. Simpson, “Topical Structure Analysis of Academic Paragraphs in English and Spanish,” Journal of Second Language Writing, vol. 9, pp. 293–309, 2000.
  • [53] A. Wang and K. Cho, “BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model,” in Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.   Association for Computational Linguistics, 2019, pp. 30–36.
  • [54] T. Shen, V. Quach, R. Barzilay, and T. Jaakkola, “Blank Language Models,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing.   Association for Computational Linguistics, 2020, pp. 5186–5198.
  • [55] J. R. Landis and G. G. Koch, “The Measurement of Observer Agreement for Categorical Data,” Biometrics, vol. 33, p. 159, 3 1977.
[Uncaptioned image] Andrea Busto-Castiñeira received the B.S. degree in Telecommunication Technologies Engineering in 2020, and the M.S. degree in Telecommunication Engineering in 2022 from University of Vigo, Spain, where she is currently pursuing the Ph.D. degree in Information and Communication Technologies. Since 2021. she has been working as a researcher with the Information Technologies Group. Her research interests include Natural Language Processing and pre-trained language models.
[Uncaptioned image] F. Javier González-Castaño received the B.S. degree from University of Santiago de Compostela, Spain, in 1990, and the Ph.D. degree from University of Vigo, Spain, in 1998. He is currently a professor at University of Vigo, Spain, where he leads the Information Technologies Group. He has authored over 100 papers in international journals in the fields of telecommunications and computer science, and has participated in several relevant national and international projects. He holds three U.S. patents.
[Uncaptioned image] Silvia García-Méndez received the Ph.D. degree in Information and Communication Technologies from University of Vigo in 2021. Since 2015, she has been working as a researcher with the Information Technologies Group at University of Vigo. She is currently collaborating with foreign research centers as part of her postdoctoral stage. Her research interests include Natural Language Processing techniques and Machine Learning algorithms.
[Uncaptioned image] Francisco de Arriba-Pérez received the B.S. degree in telecommunication technologies engineering in 2013, the M.S. degree in telecommunication engineering in 2014, and the Ph.D. degree in 2019 from University of Vigo, Spain. He is currently a researcher in the Information Technologies Group at the University of Vigo, Spain. His research includes the development of Machine Learning solutions for different domains like finance and health.
\EOD