\history

Date of publication 28 June 2024. 10.1109/ACCESS.2024.3420710

\tfootnote

This work was supported in part by Xunta de Galicia grants ED481B-2021-118, ED481B-2022-093, and ED431C 2022/04, Spain; Ministerio de Educación grant FPU21/00798, Spain; Ministerio de Ciencia e Innovación grant TED2021-130824B-C21, Spain; and by the Galician Supercomputing Center (CESGA), Spain, which provided the computing resources required.

\corresp

Corresponding author: Andrea Busto-Castiñeira (e-mail: [email protected]).

Predictability and Causality in Spanish and English Natural Language Generation

ANDREA BUSTO-CASTIñEIRA1 FRANCISCO J. GONZáLEZ-CASTAñO1 SILVIA GARCíA-MéNDEZ1 and FRANCISCO DE ARRIBA-PéREZ1 Information Technologies Group, atlanTTic, Telecommunication Engineering School, University of Vigo, 36310 Vigo, Spain

Abstract

In recent years, the field of Natural Language Generation (NLG) has been boosted by the recent advances in deep learning technologies. Nonetheless, these new data-intensive methods introduce language-dependent disparities in NLG as the main training data sets are in English. Also, most neural NLG systems use decoder-only (causal) transformer language models, which work well for English, but were not designed with other languages in mind. In this work we depart from the hypothesis that they may introduce generation bias in target languages with less rigid word ordering, subject omission, or different attachment preferences for relative clauses, so that for these target languages other language generation strategies may be more desirable.

This paper first compares causal and non-causal language modeling for English and Spanish, two languages with different grammatical structures and over 1.5 billion and 0.5 billion speakers, respectively. For this purpose, we define a novel metric of average causal and non-causal context-conditioned entropy of the grammatical category distribution for both languages as an information-theoretic a priori approach.

The evaluation of natural text sources (such as training data) in both languages reveals lower average non-causal conditional entropy in Spanish and lower causal conditional entropy in English. According to this experiment, Spanish is more predictable than English given a non-causal context. Then, by applying a conditional relative entropy metric to text generation experiments, we obtain as insights that the best performance is respectively achieved with causal NLG in English, and with non-causal NLG in Spanish. These insights support further research in NLG in Spanish using bidirectional transformer language models.

Index Terms:

Language predictability, natural language generation, non-causal language modeling, Spanish language, transformer language models.

\titlepgskip

=-21pt

I Introduction

Thanks to their capacity to acquire universal language representations from vast amounts of unlabeled text data, transformer-based Natural Language Generation (NLG) models [1] have achieved unprecedented success [2].

Transformers are based on Sequence-to-Sequence (seq2seq) language models [3] by enhancing them with positional encoding, which enables parallel training while still considering word order, and a novel self-attention mechanism that selects the most relevant parts of the input sequence. The unsupervised nature of transformers’ pre-training facilitates handling vast raw text data.

However, the linguistic prevalence of English on the Internet is a primary source of bias [4]. Most evaluation data sets and benchmarks are primarily or entirely written in English with very few exceptions [5], and therefore the most innovative contributions rarely target other languages. This linguistic data imbalance has a detrimental effect on polyglot language models’ word embeddings [6, 7], as tokenizers assign longer tokens to character sequences in languages with more extensive representation [8, 9, 10], despite of some proposed solutions such as those in [11] and [12]. The multilingual NLP paradigm of cross-lingual transfer learning, which tends to view non-English languages as particular use cases, makes language models favor English-like grammars with more strict word ordering and explicit subject [13].

Nowadays, the majority of generative language models are decoder-only transformers, being OpenAI’s GPT-3/3.5/4 [14], and ChatGPT the most popular, and Big Science’s BLOOM, Meta AI’s OPT, and Google AI’s BARD also well-known examples. Most of these models are multilingual, however, their performance varies between languages. This led to monolingual implementations of the smaller model GPT-2 in languages other than English, for example in German¹¹1Available at: https://cedille.ai, June 2024., French1, Italian [15], and Spanish [16].

These generative models are exclusively causal, that is, they produce text from left to right by recursively feeding the model with previously generated sequences. As in the case of Recurrent Neural Networks (RNN), decoder-only transformers are expectation-based word predictors. These systems tend to favor structures in which related elements are close along the sequence, such as relative clause attachments to syntactically lower nominals in ambiguous contexts, which fits nicely into English syntax [17].

However, the mutually beneficial congruence between causal language modeling and English may not apply to other languages. Not only does Spanish prefer a higher nominal attachment in the resolution of ambiguous relative clauses, but its syntax is also highly flexible, even within declarative sentences [18]. This is strongly opposed to the more strict subject-verb-object structure of the English language, which allows for few inversion exceptions [19].

Unlike causal language models, encoder-only non-causal language models generate word embeddings using bidirectional contexts, which means that the model output can be conditioned by both left and right tokens. This eliminates the output sequence’s sequential dependencies and allows alternative generation orders.

In light of this, we depart from the hypothesis that decoder-only (causal) transformer language models may introduce generation bias in target languages with less rigid word ordering than English, subject omission, or different attachment preferences for relative clauses, so that for these target languages other language generation strategies may be more desirable. To put this hypothesis to test, in addition to English, we consider and Spanish, a language with a different grammatical structure and also a broad base of speakers (these languages sum over 1.5 billion and 0.5 billion speakers, respectively, a substantial share of the world’s population). However, the approaches in this work can be extended to obtain insights on other languages and NLP tasks.

Our contributions are:

A.

First, we present a novel information-theoretic approach to study language predictability. We compare the causal context-conditioned entropy and the non-causal context-conditioned entropy of the grammatical category distribution of source natural texts to assess whether their language is more predictable from causal or non-causal language contexts. This reveals lower average non-causal conditional entropy in Spanish and lower causal conditional entropy in English. According to this assessment, Spanish is more predictable than English given a non-causal context.
B.

Then, using both automatic (based on conditional relative entropy) and manual evaluation methodologies, we put decoder-only and encoder-only transformer language models to test to assess empirical causal and non-causal NLG performance, seeking to evaluate if the currently dominant causal NLG paradigm is adequate from a language-agnostic perspective or whether specific languages may benefit from other word generation orderings. We obtain as insights that the best performance is achieved with causal NLG in English and non-causal NLG in Spanish. These insights support further research in NLG in Spanish using bidirectional transformer language models instead of the dominant decoder-only ones.

The rest of this paper is organized as follows. Section II reviews related work on both psycholinguistic language predictability and language model causality in NLG. Section III describes the proposed analytical methodology used for the experiments. Sections IV and V present the details and results of the assessments of predictability and text generation performance, respectively. Section VI summarizes and discusses the results obtained. Finally, Section VII concludes the paper.

II Related work

In this section, we discuss relevant works on both causality in NLG (Section II-A) and language predictability (Section II-B).

II-A Causality in generative transformer language models

The contextual awareness of a transformer is controlled by self-attention. The base concept behind this attention mechanism is a mapping of a query ( $\mathbf{q}$ ) into pairs of keys ( $\mathbf{k}$ ) and values ( $\mathbf{v}$ ). By respectively denoting the queries’, keys’, and value sets’ matrices as $\mathbf{Q}$ , $\mathbf{K}$ and $\mathbf{V}$ , we define self-attention as:

Attention\left(\mathbf{Q},\mathbf{K},\mathbf{V}\right)=softmax\left(\frac{% \mathbf{Q}\mathbf{K}^{T}}{\sqrt{\left|\mathbf{k}\right|}}\right)\mathbf{V}

(1)

Transformers, rather than a single attention function, project queries, keys, and values onto $h$ separate heads. This is called multi-head attention:

MultiHead\left(\mathbf{Q},\mathbf{K},\mathbf{V}\right)=concat\left(\mathrm{% head}_{1},\cdots,\mathrm{head}_{h}\right)\mathbf{W}^{O}

(2)

By denoting each head attention function as:

\mathrm{head}_{i}=Attention\left(\mathbf{Q}\mathbf{W}_{i}^{Q},\mathbf{K}% \mathbf{W}_{i}^{K},\mathbf{V}\mathbf{W}_{i}^{V}\right)

(3)

where $\mathbf{W}_{i}^{Q}$ , $\mathbf{W}_{i}^{K}$ , $\mathbf{W}_{i}^{V}$ and $\mathbf{W}^{O}$ are parameter projection matrices for the queries, keys, values, and output respectively.

This attention mechanism is present in all the layers of both the encoder and the decoder, if present. While the encoder’s attention is bidirectional, the decoder has two different types of attention: (i) a masked multi-head attention block that masks non-causal context and (ii) a bidirectional multi-head attention block that receives non-causal information from the encoder.

Even though this encoder-decoder architecture is popular in some NLP tasks such as machine translation [20, 21, 22, 23], several transformer-based models only have one of these components. By omitting the encoder in decoder-only transformers, all non-causal contextual dependencies are removed by exclusively using masked attention. Decoder-only transformers are nowadays the best performing task-agnostic NLG systems. Nevertheless, there exist some state-of-the-art non-causal NLG solutions. For example, non-causal language models can be trained for the Masked Language Modeling (MLM) objective, a task in which the language model predicts masked words within a sentence [24]. Typically, non-causal NLG systems are focused on particular tasks such as speech recognition [25, 26, 27], style transfer and grammar correction [28], textual data augmentation [29], and task-specific dialog systems [30, 31].

II-B Language predictability

Conditional entropy is a typical metric for evaluating the predictability of a problem given its input variables and expected output probability distribution [32, 33]. Conditional entropy $H(X\mid Y)$ measures the extra information carried by a variable $X$ when another conditional variable $Y$ is available as side information.

In psycholinguistics, surprisal theory also uses this information-theoretic concept to quantify processing difficulty in sentence comprehension [34, 35]. Multiple studies provide empirical evidence for this expectation-based theory by showing correlations between textual surprisal and both test subjects’ reading times, as in [36], and brain activity, as in [37].

Even if generally accepted, surprisal theory does not model working memory in text comprehension, disregarding processing difficulties in integrating words or components that are widely apart within a text [38, 39, 40, 41, 42]. Lossy context surprisal [43] combines expectation and memory-based predictability theories by modeling working memory constraints as noise. Even though this model premise is independent of language, it can accurately reflect several language-specific text-processing phenomena.

Lossy context surprisal recreates structural forgetting by dropping part of the context and re-sampling it incorrectly from the a priori language knowledge probability model. Structural forgetting [44] is a common grammatical illusion in English in which ungrammatical double-embedded relative clauses can be perceived as correct. Probabilistic language expectations can determine this exclusively. In [45] it is proven that native and non-native speakers show structural forgetting in English, but do not behave this way when presented with the same syntactic structures in German or Dutch. This propensity of English probabilistic distribution to such backward prediction mistakes is coherent with the issue of non-causal English text generation.

The main goals of neural NLP and psycholinguistics approaches to language cognition are very similar: (i) to give formally explicit descriptions of the mental structures underpinning cognitive processes, and (ii) to explain the learning mechanisms behind them [46]. Even if research in these areas tends to diverge, recent contributions to the study of linguistic theory use language models [47], further evidencing their alignment.

However, to our knowledge, psycholinguistics concepts have yet to be applied to neural language modeling other than for data set elaboration [48]. With this in mind, our work provides a novel linguistic-based conditional entropy hypothesis test for language modeling causality (see contribution 1 above), whose findings can support future NLG designs and methodologies.

III Methodology

III-A Predictability hypothesis test

Causal language models predict the next token in a sequence of tokens. These models are solely concerned with the left context for sinistrodextral (i.e., written from left to right) languages such as Spanish and English (conversely, non-causal models trained on the MLM task consider the bidirectional context for blank-filling-based text generation).

Given a sequence of tokens $X$ as context, language models provide the probability mass function for the next predicted token $\hat{X}$ . For a generation index $i<N$ , we define the $N$ -long input causal context as follows:

\mathbf{x}_{c_{i}}=\begin{bmatrix}x_{i-N}&\dots&x_{i-1}\end{bmatrix}^{T}

(4)

And the non-causal context as:

\mathbf{x}_{n_{i}}=\begin{bmatrix}x_{0}&\dots&x_{i-1}&x_{i+1}&\dots&x_{n}\end{% bmatrix}^{T}

(5)

So that we can express the output of a causal language model as:

\mathbf{y}_{c_{i}}=\begin{bmatrix}p(\hat{X}_{i}=v_{0}\mid X_{c}=\mathbf{x}_{c_% {i}})\\ \vdots\\ p(\hat{X}_{i}=v_{\left|\mathcal{V}\right|-1}\mid X_{c}=\mathbf{x}_{c_{i}})\end% {bmatrix}

(6)

And the output of a non-causal language model as follows:

\mathbf{y}_{n_{i}}=\begin{bmatrix}p(\hat{X}_{i}=v_{0}\mid X_{n}=\mathbf{x}_{n_% {i}})\\ \vdots\\ p(\hat{X}_{i}=v_{\left|\mathcal{V}\right|-1}\mid X_{n}=\mathbf{x}_{n_{i}})\end% {bmatrix}

(7)

with vocabulary set $\mathcal{V}=\left\{\begin{matrix}v_{0}&\dots&v_{\left|\mathcal{V}\right|-1}% \end{matrix}\right\}$ of size $\left|\mathcal{V}\right|$ .

As stated in Section II-B, we use a novel metric of conditional entropy to test whether a language is more or less predictable given causal or non-causal contexts (and thus, for example, whether Spanish NLG may benefit from non-causal language generation ordering). The less conditional entropy a problem has, the more predictable its outcome. As NLG is a language prediction task in which previously generated words are available as context, we want to compare the conditional entropy in two scenarios: (i) causal text generation, in which text is generated from left to right so that we provide words to the left of the predicted one as context; and (ii) non-causal text generation, which uses both left and right context for word prediction.

In order to test the predictability of causal and non-causal language models for English and Spanish, we compute and compare the average causal and non-causal conditional entropy for textual data in both languages:

\,\overline{\!{H(\hat{X}\mid X_{c})}}=\sum_{\mathbf{x}_{c}\in\mathcal{X}^{N}_{% c}}p(X_{c}=\mathbf{x}_{c})H(\hat{X}\mid X_{c}=\mathbf{x}_{c})

(8)

\,\overline{\!{H(\hat{X}\mid X_{n})}}=\sum_{\mathbf{x}_{n}\in\mathcal{X}^{N}_{% n}}p(X_{n}=\mathbf{x}_{n})H(\hat{X}\mid X_{n}=\mathbf{x}_{n})

(9)

with:

H(\hat{X}\mid X=\mathbf{x})=\sum_{\hat{x}\in\mathcal{V}}p(\hat{x}\mid X=% \mathbf{x})\log\frac{1}{p(\hat{x}\mid X=\mathbf{x})}

(10)

It must be noted that both $\mathcal{X}^{N}_{c}$ and $\mathcal{X}^{N}_{n}$ have size $\left|\mathcal{V}\right|^{N}$ . Context length and vocabulary size determine the accuracy of our estimation results. As we have no previous information about token probability distribution, we model both $p(X)$ and $p(\hat{X}=\hat{x}\mid X=\mathbf{x})$ as categorical distributions. The estimators used for these distributions are:

\widehat{Pr}(X=\mathbf{x}_{i})=\frac{L_{i}}{L}

(11)

and

\widehat{Pr}(\hat{X}=x_{j}\mid X=\mathbf{x}_{i})=\frac{L_{i_{j}}}{L_{i}}

(12)

with $L$ being the token sequence length, $L_{i}$ the number of instances of the context $i$ , and $L_{i_{j}}$ the number of instances in which token $j$ appears given context $i$ .

In case both $p(X)$ and $p(\hat{X}=\hat{x}\mid X=\mathbf{x})$ are discrete uniform distributions, these estimators have normalized variances $\frac{\left|\mathcal{V}\right|^{N}-1}{L}$ and $\frac{\left|\mathcal{V}\right|^{N}\left(\left|\mathcal{V}\right|-1\right)}{L}$ , respectively. This means that, in order to set our estimators’ normalized variances to a specific value, the number of analyzed tokens should be proportional to $\left|\mathcal{V}\right|^{N}$ and $\left|\mathcal{V}\right|^{N+1}$ , respectively.

Therefore, given the data available, neither word nor subword tokenization are feasible. We instead use a grammatical categorization based on Part-Of-Speech (POS) tagging. It reduces vocabulary size and data requirements dramatically while maintaining the original goal.

The resulting hypothesis test evaluates how predictable natural English and Spanish syntaxes are for causal and non-causal language models. Our first intuition is that non-causal predictability, as determined by the inverse of the non-causal context-conditioned entropy, will be higher for Spanish than for English and the opposite for causal predictability. By validating this, we can demonstrate that causal ordering may not be ideal for Spanish NLG, paving the way for further study of non-causal Spanish text generation approaches based on bidirectional transformers.

III-B Non-causal text generation

For non-causal NLG, first we start with a sequence of [MASK] tokens of the desired length $K$ . At each iteration, we re-sample every token once. We mask and fill tokens in groups of size $N$ . In order to fill the masked tokens, we sample the output of a non-causal language model, from which we remove adjacent tokens, short prefixes and suffixes, and unknown tokens to enhance the overall quality of the produced sequence. This process is formally described in Algorithm 1.

Algorithm 1 Non-causal text generation.

\mathbf{x}\leftarrow\begin{bmatrix}\mathrm{[MASK]}&\dots&\mathrm{[MASK]}\end{% bmatrix}_{K}

i\leftarrow 0

while

i

<

I

index

\leftarrow\texttt{{Shuffle}}\left(\begin{bmatrix}0&\cdots&&K-1\end{bmatrix}\right)

i\leftarrow 0

while

j

<

K-N

masked

\leftarrow

index

\left[\begin{bmatrix}j&\cdots&\min\left(j+N,K-1\right)\end{bmatrix}\right]

x[

masked

]\leftarrow

[MASK]

\mathbf{y}\leftarrow\texttt{{NonCausalLM}}\left(\mathbf{x}\right)

for all

m\in

masked do

prob, idx

\leftarrow\texttt{{Filter}}\left(\mathbf{y}\left[m\right]\right)

\mathbf{x}\left[m\right]\leftarrow\texttt{{Sample}}

(prob, idx)

end for

j\leftarrow j+N

end while

i++

end while

The number of iterations $I$ and the number of tokens masked in each generation step $N$ must be predetermined. These parameters influence the performance and computational efficiency of the algorithm. More masked tokens per generation step mean fewer calls to the language model function ( $\lceil\frac{K}{N}\rceil$ calls per iteration), resulting in improved computing efficiency. In this work we set $N=2$ and $I=30$ .

III-C Automatic evaluation

The relative entropy $\mathcal{D}_{KL}\left(P\mid\mid Q\right)$ , also known as KL divergence, quantifies the expected increase in uncertainty that comes from modeling a reference distribution $P$ as another distribution $Q$ . In this study, we use the following formulation for the conditional relative entropy for both causal ( $X=X_{c}$ ) and non-causal ( $X=X_{n}$ ) contexts:

\begin{split}\mathcal{D}_{KL}\left(P(\hat{X}\mid X)\mid\mid Q(\hat{X}\mid X)% \right)=&\sum_{\mathbf{x}\in\mathcal{X}^{N}}q(\mathbf{x})\sum_{\hat{x}\in% \mathcal{V}}p(\hat{x}\mid X=\mathbf{x})\\ &\log\frac{p(\hat{x}\mid X=\mathbf{x})}{q(\hat{x}\mid X=\mathbf{x})}\end{split}

(13)

where $p$ and $q$ are the conditional probability density functions of our reference textual dataset’s POS tags and the sequences to evaluate, respectively.

III-D Manual evaluation

The annotators were asked yes/no questions on the following aspects to assess the quality of the generated sequences:

•

Q1. Concordance, penalizing improper use of verb tenses, number, and, if applicable, gender of determinants, adjectives, nouns, and pronouns.
•

Q2. Syntactic structure correctness, by checking if all sequences have at least one subject and one verb and assessing that the sentences are syntactically sound in general.
•

Q3. Word or phrase-level repetitions, by penalizing word redundancy, duplication in enumerations, and subject redundancy, while trying to respect those repetitions that may be considered stylistic choices.
•

Q4. Word sense. Language models can generate new words by combining prefixes, suffixes, and pronouns as sequential tokens. This question penalizes nonsensical words.

We assessed inter-agreement with accuracy and $\alpha$ -reliability [49] to verify that the annotations were neither arbitrary ( $acc=\alpha=0$ ) nor redundant ( $acc=\alpha=1$ ).

Finally, we also included a more general rating question (Q5) in which we asked annotators to provide a numerical score between 1 and 5 based on their impression of the annotation and their experience.

IV Predictability test results

We used two different experimental setups to perform the hypothesis test in Section III-A: (i) A first setup using two data sets, one in English and another in Spanish, which are exact translations of each other (Section IV-A); and (ii) another setup with larger, relatively similar English and Spanish data sets that cannot be considered exact translations of each other (Section IV-B). Both setups have advantages and disadvantages. It is more desirable to compare parallel content (setup #1), but bigger data volumes allow for analyzing lengthier contexts (setup #2). As mentioned, we preprocessed the data sets with a POS tagger. The POS tagging module used the spaCy es_core_news_sm²²2Available at: https://spacy.io/models/es, June 2024. and en_core_web_sm³³3Available at: https://spacy.io/models/en, June 2024. pipelines for Spanish and English, respectively. In order to balance the categories, we reduced the original seventeen Universal POS tags⁴⁴4Available at: https://universaldependencies.org/u/pos, June 2024. to the following nine: adjectives (ADJ), adpositions (ADP), adverbs (ADV), conjunctions (CONJ), determiners (DET), nouns (NOUN), pronouns (PRON), verbs (VERB), and a last category combining unknown words, interjections, blank spaces, punctuation marks, and symbols (OTHER).

We executed the experiments using two Nvidia A100 GPUs with the specifications in Table I.

Table I: Specifications of the GPUs.

Nvidia A100-PCIE-40GB specifications
CUDA Driver/Runtime Version	11.5/11.2
CUDA Capability Version	8.0
Memory	40 GB (HBM2 bw: 1555 GB/s)
Multiprocessors	108
CUDA Cores	6912 (64 per MP)
GPU Max Clock rate: 1.41 GHz

IV-A Tale data sets

Table II: Tale data set sources.

Source	Sprache
Ciudad Seva⁵⁵5Available at: https://ciudadseva.com, June 2024.	Spanish
Rincón Castellano⁶⁶6Available at: https://www.rinconcastellano.com, June 2024.	Spanish
Elejandría⁷⁷7Available at: https://www.elejandria.com, June 2024.	Spanish
Andersenstories.com⁸⁸8Available at: https://www.andersenstories.com, June 2024.	Spanish & English
Grimmstories.com⁹⁹9Available at: https://www.grimmstories.com, June 2024.	Spanish & English
Americanliterature.com¹⁰¹⁰10Available at: https://americanliterature.com, June 2024.	Englisch
D. L. Ashliman’s compilation¹¹¹¹11Available at: https://sites.pitt.edu/~dash/perrault.html, June 2024.	Englisch
Long long time ago¹²¹²12Available at: https://www.longlongtimeago.com, June 2024.	Englisch
Project Gutenberg¹³¹³13Available at: https://www.gutenberg.org, June 2024.	Englisch
The H.P. Lovecraft Archive¹⁴¹⁴14Available at: https://www.hplovecraft.com, June 2024.	Englisch

The tale data sets of setup #1 consisted of public domain tales, short novels, and fables with Creative Commons-licensed translations. We crawled the English and Spanish text collections for the sizes of the respective datasets to be identical (3.7M words of raw text each) so that they could be considered direct translations, from Portable Document Format (PDF) with the Python pdftotext¹⁵¹⁵15Available at: https://pypi.org/project/pdftotext, June 2024. library and by web scrapping using Scrapy¹⁶¹⁶16Available at: https://scrapy.org, June 2024. web spiders. Table II shows all the data sources.

We could not compute conditional entropy values for contexts longer than two words due to data set size constraints. However, because of the similarities in content between English and Spanish data, we could efficiently study short-term grammatical dependencies in both languages. In the experiment, we compared the causal two-word context with a specific case of non-causal context in which the predicted word lies between the two contextual words. We provide a brief qualitative analysis of the contexts that resulted in lower conditional entropy values for the predicted term $\hat{X}_{i}$ for causal and bidirectional contexts in both languages.

Table III: Two-word context-conditioned entropy in bits per tag, tale data set.

Sprache	$\,\overline{\!{H(\hat{X}_{i}\mid X_{i-2},X_{i-1})}}$	$\,\overline{\!{H(\hat{X}_{i}\mid X_{i-1},X_{x_{i}+1})}}$
Spanish	2.3331	1.8444
Englisch	2.2193	1.9726

Table III shows the results of hypothesis testing on English and Spanish tale data sets. They are coherent with our initial intuition that Spanish is more suited for non-causal text prediction than English. However, by examining Table III row-wise, middle tag prediction seemed to have lower entropy than causal text prediction in both languages. This does not necessarily mean that non-causal NLG outperformed its causal counterpart, as this experiment disregarded relevant factors, such as initial text generation steps.

Table IV: Low entropy causal contexts, Spanish tale data set.

$\left(X_{i-2},X_{i-1}\right)$	$MaxProb\left(\hat{X}_{i}\right)$	$H(\hat{X}_{i}\mid X_{i-2},X_{i-1})$
(ADP, DET)	NOUN	0.8567
(DET, DET)	NOUN	0.9571
(VERB, DET)	NOUN	0.9985
(ADV, PRON)	VERB	1.0101
(PRON, PRON)	VERB	1.0709
(ADV, DET)	NOUN	1.0919
(CONJ, DET)	NOUN	1.1460
(ADJ, DET)	NOUN	1.2238
(OTHER, DET)	NOUN	1.3036
(CONJ, PRON)	VERB	1.3081
(NOUN, PRON)	VERB	1.4421
(NOUN, DET)	NOUN	1.4576
(DET, ADJ)	NOUN	1.5038

Table V: Low entropy bidirectional contexts, Spanish tale data set.

$\left(X_{i-1},X_{i+1}\right)$	$MaxProb\left(\hat{X}_{i}\right)$	$H(\hat{X}_{i}\mid X_{i-1},X_{i+1})$
(DET, ADP)	NOUN	0.3316
(DET, OTHER)	NOUN	0.4059
(DET, CONJ)	NOUN	0.5770
(ADP, NOUN)	DET	0.7159
(DET, ADJ)	NOUN	0.7238
(DET, ADV)	NOUN	0.8258
(PRON, ADP)	VERB	0.8354
(DET, PRON)	NOUN	0.8426
(DET, VERB)	NOUN	0.9659
(NOUN, CONJ)	OTHER	1.2998
(PRON, DET)	VERB	1.3234
(ADJ, CONJ)	OTHER	1.3343
(PRON, ADV)	VERB	1.3719
(ADP, OTHER)	NOUN	1.3813
(VERB, NOUN)	DET	1.4323
(ADP, CONJ)	NOUN	1.4736
(CONJ, NOUN)	DET	1.4812
(PRON, OTHER)	VERB	1.5257
(ADP, ADP)	NOUN	1.5785

Table VI: Low entropy causal contexts, English tale data set.

$\left(X_{i-2},X_{i-1}\right)$	$MaxProb\left(\hat{X}_{i}\right)$	$H(\hat{X}_{i}\mid X_{i-2},X_{i-1})$
(ADV, PRON)	VERB	1.0253
(NOUN, DET)	NOUN	1.0716
(ADP, DET)	NOUN	1.1451
(CONJ, PRON)	VERB	1.2328
(OTHER, PRON)	VERB	1.2841
(ADJ, ADJ)	NOUN	1.2921
(VERB, DET)	NOUN	1.3121
(DET, ADJ)	NOUN	1.3140
(CONJ, DET)	NOUN	1.3340
(OTHER, DET)	NOUN	1.3978
(ADV, DET)	NOUN	1.4669
(PRON, DET)	NOUN	1.4800
(NOUN, NOUN)	NOUN	1.5256

Table VII: Low entropy bidirectional contexts, English tale data set.

$\left(X_{i-1},X_{i+1}\right)$	$MaxProb\left(\hat{X}_{i}\right)$	$H(\hat{X}_{i}\mid X_{i-1},X_{i+1})$
(DET, ADP)	NOUN	0.3183
(DET, VERB)	NOUN	0.5168
(DET, ADV)	NOUN	0.5461
(DET, OTHER)	NOUN	0.5776
(DET, CONJ)	NOUN	0.6541
(ADJ, ADP)	NOUN	0.9530
(ADJ, OTHER)	NOUN	0.9721
(NOUN, CONJ)	OTHER	1.0294
(DET, PRON)	NOUN	1.0911
(ADJ, CONJ)	NOUN	1.3138
(ADP, ADJ)	DET	1.3873
(CONJ, VERB)	PRON	1.4157
(PRON, ADV)	VERB	1.4580
(NOUN, OTHER)	NOUN	1.5198

Tables IV, V, VI and VII show left and bidirectional context-predicted word pairs with entropy lower than $\log(3)\approx 1.585$ . We can note that the number of combinations satisfying this condition for the Spanish data set was much higher for bidirectional contexts than for causal contexts. Figure 1 shows that, for Spanish, causal patterns were also less diverse, as all of them relied on either pronoun/verb or determinant/noun grammatical dependencies.

As shown in tables VI and VII, causal and bidirectional low entropy contexts in English were more balanced. Figure 2 reflects the prevalence of the adjective/noun dependencies (especially for the causal case). Conversely, adjectives have a much less rigid position in Spanish within the sentence.

IV-B Wikidumps data sets

The data sets of setup #2 comprised a collection of Wikipedia articles from Wikimedia’s Spanish¹⁷¹⁷17Available at: https://dumps.wikimedia.org/eswiki/20220801, June 2024. and English¹⁸¹⁸18Available at: https://dumps.wikimedia.org/enwiki/20220801, June 2024. dumps. We extracted and cleaned textual data from these dumps using WikiExtractor¹⁹¹⁹19Available at: https://github.com/attardi/wikiextractor, June 2024.. Then we loaded and mapped the resulting JSON files with HuggingFace’s Datasets²⁰²⁰20Available at: https://github.com/huggingface/datasets, June 2024. library. We picked one million random non-empty articles in each language for hypothesis testing.

The amount of data available allowed accurately computing the conditional entropy for longer contexts than in the previous case, yielding average conditional entropy results for contexts up to six words. For this experiment, we explored all possible non-causal contexts and assessed the impact of the location of the predicted tag on our predictability results.

Table VIII: Conditional entropy in bits per tag, Wikidump data set.

Context	Sprache	$\,\overline{\!{H(\hat{X}\mid X_{c})}}$	$\,\overline{\!{H(\hat{X}\mid X_{n})}}$	$\,\overline{\!{H(\hat{X}\mid X_{n}\setminus X_{c})}}$
$N=2$	Spanish	2.4981	2.3192	1.9609
$N=2$	Englisch	2.3335	2.2306	2.0255
$N=3$	Spanish	2.4258	2.1286	1.8313
$N=3$	Englisch	2.2903	2.1077	1.9257
$N=4$	Spanish	2.3686	1.9819	1.7247
$N=4$	Englisch	2.2559	2.0158	1.8544
$N=5$	Spanish	2.3412	1.8252	1.5679
$N=5$	Englisch	2.2329	1.9399	1.7942
$N=6$	Spanish	2.2817	1.6351	1.3758
$N=6$	Englisch	2.1846	1.8599	1.7200

Table VIII shows the average causal and non-causal conditional entropy results for the Wikidumps data set. Compared to the tale data set with tighter equivalence between languages, Spanish conditional entropy values often exceeded those for English. For shorter contexts, even non-causal conditional entropy was higher in the Spanish data set than in the English data set. For this reason, we report average entropy as conditioned by the set of non-causal contexts by subtracting the causal context, which is always lower in Spanish than English, highlighting the lower efficiency of left-to-right word prediction in Spanish.

We can further appreciate this effect in Figure 3, which depicts the average conditional entropy for all possible predicted tag locations within contexts of diverse lengths. As we can see, the more balanced the left and right contexts are, the more predictable the grammatical category in both languages. However, this tendency was considerably more noticeable in Spanish and became more evident as the context was longer. Since the average number of words per sentence tends to be higher in Spanish than in English [50, 51, 52], Spanish non-causal NLG is even more promising.

V Text generation results

For the text generation experiment, we used four different language models: Deep ESP’s Spanish GPT-2²¹²¹21Available at: https://huggingface.co/DeepESP/gpt2-spanish, June 2024. and University of Chile’s Spanish BERT²²²²22Available at: https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased, June 2024. for Spanish causal and non-causal language modeling, respectively, and OpenAI’s GPT-2 small²³²³23Available at: https://huggingface.co/gpt2, June 2024. and Google’s BERT base²⁴²⁴24Available at: https://huggingface.co/bert-base-cased, June 2024. for English causal and non-causal language modeling, respectively.

Therefore, we used the small versions of these language models, which range from 110M to 117M parameters. We are aware that these models are not on par with state-of-the-art generative language models, but our goal was not to achieve the best text generation results but rather to compare the performance of causal and non-causal language models with similar characteristics. The chosen models were comparable in both training data, number of parameters, and vocabulary size, thus appropriate for this experiment.

We fine-tuned the models on the tales data set described in Section IV-A using HuggingFace’s Transformers²⁵²⁵25Available at: https://github.com/huggingface/transformers, June 2024. library. The training was executed in 10 epochs with batch size 8 using a Nvidia A100-PCIE-40GB (see Table I).

Then, we generated 1,000 different 50-token sequences for each of the four language models. We assessed the generation performance of each language model using both automatic (Section V-A) and manual (Section V-B) evaluation metrics.

V-A Automatic evaluation

In the sequel, we will call “opposite” the language, either Spanish or English, for which the model was not trained for an experiment. We used the conditional relative entropy metric described in Section III-C for automatic evaluation to compare the produced sequences from the tales data sets in the target and opposite languages. This was possible because language-independent POS tagging was used for tokenization.

As expected, Table IX reveals that the sequences generated by all four models adhered more closely to their respective target language’s POS tag distributions. There were no significant differences between causal and non-causal context-conditioned relative entropy results.

Given its conditional relative entropy values one order of magnitude greater than those of the other three models, English BERT performed significantly worse in terms of adherence to the target language ( $\sim$ 0.02 vs $\sim$ 0.1), which is consistent with the state of the art [53, 54]. For Spanish, the results of causal and non-causal NLG are more comparable. The Spanish BERT non-causal language model had the lowest conditional relative entropy, outperforming even English GPT-2 when considering adherence to the respective target languages.

Table IX: Conditional relative entropy results.“Causal” and “non-causal” refer to causal and non-causal context-conditioned relative entropy, respectively.

	Spanish Data Set		English Data Set
	causal	non-causal	causal	non-causal
Model
Spanish GPT-2	0.0351	0.0387	1.1903	1.2588
Spanish BERT	0.0210	0.0221	1.3512	1.4049
English GPT-2	1.0096	1.5505	0.0245	0.0292
English BERT	0.8817	1.4437	0.1010	0.1050

These results show that causal models corresponded more closely to English grammar and non-causal models corresponded more closely to Spanish grammar (by considering grammar as reflected in the English and Spanish datasets). Note that this also held when using the models of the opposite language. That is, the conditional relative entropy of English Bert for Spanish was lower than the conditional relative entropy of English GPT-2 for Spanish, and the conditional relative entropy of Spanish GPT-2 for English was lower than the conditional relative entropy of Spanish BERT for English.

V-B Manual evaluation

We chose 250 sequences at random for each pairing of language (Spanish or English) and model (BERT or GPT-2) to reduce the annotation load while still obtaining useful insights. Each sequence was examined independently using the questions in Section III-D.

Five annotators participated in the evaluation. Table X shows the global inter-agreement analysis of yes/no replies. Using the thresholds by [55], $\alpha$ -reliability coefficients lay between fair and moderate agreement. Accuracy values were also acceptable for all questions. The grammatical structure (Q2) was the most controversial aspect of the first four questions due to different opinions on linguistic demand, as some annotators were more lenient with one of the languages.

Table X: Global inter-agreement metrics for yes/no questions.

	Q1	Q2	Q3	Q4
Accuracy	0.906	0.818	0.829	0.966
$\alpha$ -reliability	0.296	0.242	0.442	0.491

Table XI: Manual evaluation results for yes/no questions.

	Q1	Q2	Q3	Q4
Model	(‘yes’ % )	(‘yes’ % )	(‘no’ % )	(‘yes’ % )
Spanish GPT-2	91.6	90.8	77.6	96.4
Spanish BERT	95.2	91.6	85.6	98.0
English GPT-2	99.2	93.6	89.2	98.4
English BERT	96.4	89.2	84.8	94.8

Next, we can see that the manual assessment scores for all four questions in Table XI were consistent with the automatic metrics: BERT was considered to perform better in Spanish and GPT-2 to perform better in English. The best outcome for Spanish language models was in word sense (Q4), whereas English language models scored better in word concordance (Q1). This is consistent with the fact that there is no gender concordance in English. For all language models, the more challenging question was word repetition (Q3).

Table XII: Manual evaluation results, general assessment question.

Model	Average	Norm. Average
Spanish GPT-2	3.682	0.920
Spanish BERT	4.124	1.034
English GPT-2	4.301	1.078
English BERT	3.870	0.967

For the more subjective fifth question, as the average rating from one annotator to another ranged from $3.529$ to $4.31$ , we normalized the scores of each annotator for an average of 1.0. The results of Table XII indicate that the subjectively perceived quality of Spanish texts generated by BERT is higher than when using GPT-2, vice versa in the case of English, which is consistent with our initial intuition and all the results so far.

VI Discussion

All the results of the previous section are aligned with our initial intuition that Spanish is more suited for non-causal language modeling than English. As shown in Table XIII, natural English text was demonstrated to be more predictable than text in Spanish given a causal context in the predictability test results, by a relatively constant margin of $\sim$ 5%. However, given a non-causal context, Spanish was more predictable than English, by an increasing margin as the context got longer.

In the automatic evaluation with conditional relative entropy, Spanish BERT showed the highest adherence to its target language grammar.

These results are consistent with the text generation ranking summarized in Table XIV (whose row “automatic evaluation” reflects the conditional relative entropies in Table IX), as English GPT-2 performed better than English BERT, and Spanish BERT better than Spanish GPT-2 in all the evaluation experiments. Manual evaluation consistently ranked English GPT-2 and Spanish BERT as the best language models for NLG. Spanish BERT ranked worse than English BERT in concordance, but, as previously stated, this results might be biased by the lack of gender concordance in English.

Overall, the results of our experiments show that non-causal language modeling is more promising for Spanish NLG than for English.

Table XIII: Predictability test, automatic evaluation results summary

		Highest predictability
Data set	Context length	$\,\overline{\!{H(\hat{X}\mid X_{c})}}^{-1}$	$\,\overline{\!{H(\hat{X}\mid X_{n}\setminus X_{c})}}^{-1}$
Tales	$N=2$	English ( $+5.13\%$ )	Spanish( $+6.95\%$ )
Wikidumps	$N=2$	English ( $+7.05\%$ )	Spanish ( $+3.29\%$ )
	$N=3$	English ( $+5.92\%$ )	Spanish ( $+5.15\%$ )
	$N=4$	English ( $+4.99\%$ )	Spanish ( $+7.52\%$ )
	$N=5$	English ( $+4.85\%$ )	Spanish ( $+14.43\%$ )
	$N=6$	English ( $+4.45\%$ )	Spanish ( $+25.02\%$ )

Table XIV: Text generation ranking, manual evaluation results summary

Experiment	GPT-2	BERT	GPT-2	BERT
	Spanish		Englisch
Automatic evaluation	#3	#1	#2	#4
Q1. Concordance	#4	#3	#1	#2
Q2. Syntactic structure	#3	#2	#1	#4
Q3. Repetitions	#4	#2	#1	#3
Q4. Word sense	#3	#2	#1	#4
Q5. General rating	#4	#2	#1	#3

VII Conclusions

In this paper, we have first assessed English and Spanish predictability given causal and non-causal contexts, demonstrating that Spanish is more predictable given a non-causal context.For this purpose, we developed and computed a novel metric of the average causal and non-causal context-conditioned entropies of the grammatical categories present in similar and strictly parallel English and Spanish textual data sets. The experiments have shown that average causal context-conditioned entropy is higher in Spanish texts than in English texts, and that average non-causal context-conditioned entropy is higher in English texts than in Spanish ones. This was further supported by a a study of the grammatical dependencies that are more predictable in each language and how word location within a context influences predictability.

Following the validation of the hypothesis about the relation between causal- and non-causal contexts and language predictability, we selected causal and non-causal language generators based in Spanish and English models to analytically assess their quality depending on the target language to generate. To make experiments comparable, we chose similarly dimensioned unidirectional and bidirectional pre-trained transformer language models and fine-tuned them using highly equivalent Spanish and English data sets.

Finally, we evaluated the outcome both analytically and manually to assess the performance of text generation in all test scenarios. In the first case, to compare the compliance of the language models with the grammatical structure of their target languages, we have proposed a conditional relative entropy metric. Manual evaluation, which was validated using inter-agreement metrics, is coherent with the automatic evaluation, validating it.

The insights of this study suggest the interest of further research into analyses of language predictability in languages other than English, as well as on efficient text production using bidirectional transformers in Spanish and other languages with similar grammatical structures.

References

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems. MIT Press, 2017, pp. 1–11.
[2] T. Lin, Y. Wang, X. Liu, and X. Qiu, “A Survey of Transformers,” AI Open, pp. 111–132, 2022.
[3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” in Advances in Neural Information Processing Systems. MIT Press, 2014, pp. 1–9.
[4] I. Garrido-Muñoz, A. Montejo-Ráez, F. Martínez-Santiago, and L. A. Ureña-López, “A Survey on Bias in Deep NLP,” Applied Sciences, vol. 11, no. 7, pp. 1–3184, 2021.
[5] J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Nikolaev, and J. Palomaki, “TyDiQA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 454–470, 2020.
[6] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual Denoising Pre-Training for Neural Machine Translation,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020.
[7] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021, pp. 483–498.
[8] S. Wu and M. Dredze, “Are All Languages Created Equal in Multilingual BERT?” in Proceedings of the Workshop on Representation Learning for NLP. Association for Computational Linguistics, 2020, pp. 120–130.
[9] P. Rust, J. Pfeiffer, I. Vulić, S. Ruder, and I. Gurevych, “How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 2021, pp. 3118–3135.
[10] E. Erdem, M. Kuyu, S. Yagcioglu, A. Frank, L. Parcalabescu, B. Plank, A. Babii, O. Turuta, A. Erdem, I. Calixto, E. Lloret, E.-S. Apostol, C.-O. Truică, B. Šandrih, S. Martinčić-Ipšić, G. Berend, A. Gatt, and G. Korvel, “Neural Natural Language Generation: A Survey on Multilinguality, Multimodality, Controllability and Learning,” Journal of Artificial Intelligence Research, vol. 73, pp. 1131–1207, 2022.
[11] M. Artetxe and H. Schwenk, “Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 597–610, 2019.
[12] J. H. Clark, D. Garrette, I. Turc, and J. Wieting, “Canine : Pre-Training an Efficient Tokenization-Free Encoder for Language Representation,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 73–91, 2022.
[13] R. Guarasci, S. Silvestri, G. D. Pietro, H. Fujita, and M. Esposito, “BERT Syntactic Transfer: A Computational Experiment on Italian, French and English Languages,” Computer Speech & Language, vol. 71, pp. 1–19, 2022.
[14] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems. MIT Press, 2020, pp. 1–25.
[15] L. D. Mattei, M. Cafagna, F. Dell’Orletta, M. Nissim, and M. Guerini, “GePpeTto Carves Italian into a Language Model,” in Proceedings of the Italian Conference on Computational Linguistics, vol. 2769. Accademia University Press, 2020, pp. 136–143.
[16] A. Gutiérrez-Fandiño, J. Armengol-Estapé, M. Pàmies, J. Llop-Palao, J. Silveira-Ocampo, C. P. Carrino, C. Armentano-Oller, C. Rodriguez-Penagos, A. Gonzalez-Agirre, and M. Villegas, “MarIA: Spanish Language Models,” Procesamiento del Lenguaje Natural, vol. 68, no. 0, pp. 39–60, 2022.
[17] F. Davis and M. van Schijndel, “Recurrent Neural Network Language Models Always Learn English-Like Relative Clause Attachment,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020, pp. 1979–1990.
[18] K. Lahousse and B. Lamiroy, “Word Order in French, Spanish and Italian: A Grammaticalization Account,” Folia Linguistica, vol. 46, pp. 387–415, 2012.
[19] A. Assaiqeli, M. Maniam, and M. Farrah, “Inversion and Word Order in English: A Functional Perspective,” Studies in English Language and Education, vol. 8, pp. 523–545, 2021.
[20] K. Chen, R. Wang, M. Utiyama, E. Sumita, T. Zhao, M. Yang, and H. Zhao, “Towards more diverse input representation for neural machine translation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1586–1597, 2020.
[21] E. Wu, “Learning Accurate Integer Transformer Machine-Translation Models,” SN Computer Science, vol. 2, pp. 1–8, 2020.
[22] Y. Kawara, C. Chu, and Y. Arase, “Preordering encoding on transformer for translation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 644–655, 2021.
[23] T. Nguyen, L. Nguyen, P. Tran, and H. Nguyen, “Improving Transformer-Based Neural Machine Translation with Prior Alignments,” Complexity, vol. 2021, pp. 1–10, 2021.
[24] C. Zeng and S. Li, “Analyzing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets,” Wireless Communications and Mobile Computing, vol. 2021, pp. 1–17, 2021.
[25] Y. Bai, J. Yi, J. Tao, Z. Tian, Z. Wen, and S. Zhang, “Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1897–1911, 2021.
[26] N. Chen, S. Watanabe, J. Villalba, P. Zelasko, and N. Dehak, “Non-Autoregressive Transformer for Speech Recognition,” IEEE Signal Processing Letters, vol. 28, pp. 121–125, 2021.
[27] C. Wang, S. Dai, Y. Wang, F. Yang, M. Qiu, K. Chen, W. Zhou, and J. Huang, “ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1207–1218, 2022.
[28] M. Kaneko, “Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction,” Journal of Natural Language Processing, vol. 27, pp. 683–687, 2020.
[29] D. Park and C. W. Ahn, “Self-Supervised Contextual Data Augmentation for Natural Language Processing,” Symmetry, vol. 11, pp. 1–1393, 2019.
[30] V. Balaraman and B. Magnini, “Domain-aware dialogue state tracker for multi-domain dialogue systems,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 866–873, 2021.
[31] S. Yu, Y. Chen, and H. Zaidi, “AVA: A Financial Service Chatbot Based on Deep Bidirectional Transformers,” Frontiers in Applied Mathematics and Statistics, vol. 7, pp. 1–33, 2021.
[32] C. Song, Z. Qu, N. Blumm, and A.-L. Barabási, “Limits of Predictability in Human Mobility,” Science, vol. 327, pp. 1018–1021, 2010.
[33] G. Li, V. L. Knoop, and H. van Lint, “Estimate the Limit of Predictability in Short-Term Traffic Forecasting: An Entropy-Based Approach,” Transportation Research Part C: Emerging Technologies, vol. 138, pp. 1–18, 2022.
[34] J. Hale, “A Probabilistic Earley Parser as a Psycholinguistic Model,” in Proceedings of the North American Chapter of the Association for Computational Linguistics on Language technologies. Association for Computational Linguistics, 2001, pp. 1–8.
[35] R. Levy, “Expectation-Based Syntactic Comprehension,” Cognition, vol. 106, pp. 1126–1177, 2008.
[36] M. W. Lowder, W. Choi, F. Ferreira, and J. M. Henderson, “Lexical Predictability During Natural Reading: Effects of Surprisal and Entropy Reduction,” Cognitive Science, vol. 42, pp. 1166–1183, 2018.
[37] J. M. Henderson, W. Choi, M. W. Lowder, and F. Ferreira, “Language Structure in the Brain: A Fixation-Related fMRI Study of Syntactic Surprisal in Reading,” NeuroImage, vol. 132, pp. 293–300, 2016.
[38] E. Gibson, “Linguistic Complexity: Locality of Syntactic Dependencies,” Cognition, vol. 68, pp. 1–76, 1998.
[39] R. L. Lewis and S. Vasishth, “An Activation-Based Model of Sentence Processing as Skilled Memory Retrieval,” Cognitive Science, vol. 29, pp. 375–419, 2005.
[40] B. Bartek, R. L. Lewis, S. Vasishth, and M. R. Smith, “In Search of On-Line Locality Effects in Sentence Comprehension,” Journal of Experimental Psychology: Learning, Memory, and Cognition, vol. 37, pp. 1178–1198, 2011.
[41] B. Nicenboim, S. Vasishth, C. Gattei, M. Sigman, and R. Kliegl, “Working Memory Differences in Long-Distance Dependency Resolution,” Frontiers in Psychology, vol. 6, pp. 1–312, 2015.
[42] B. Nicenboim, P. Logačev, C. Gattei, and S. Vasishth, “When High-Capacity Readers Slow Down and Low-Capacity Readers Speed Up: Working Memory and Locality Effects,” Frontiers in Psychology, vol. 7, pp. 1–280, 2016.
[43] R. Futrell, E. Gibson, and R. P. Levy, “Lossy‐Context Surprisal: An Information‐Theoretic Model of Memory Effects in Sentence Processing,” Cognitive Science, vol. 44, pp. 1–54, 2020.
[44] S. Vasishth, K. Suckow, R. L. Lewis, and S. Kern, “Short-Term Forgetting in Sentence Comprehension: Crosslinguistic Evidence From Verb-Final Structures,” Language and Cognitive Processes, vol. 25, pp. 533–567, 2010.
[45] S. L. Frank, T. Trompenaars, and S. Vasishth, “Cross-Linguistic Differences in Processing Double-Embedded Relative Clauses: Working-Memory Constraints or Language Statistics?” Cognitive Science, vol. 40, pp. 554–578, 2016.
[46] J. Pater, “Generative Linguistics and Neural Networks at 60: Foundation, Friction, and Fusion,” Language, vol. 95, no. 1, pp. e41–e74, 2019.
[47] M. J. Hofmann, S. Remus, C. Biemann, R. Radach, and L. Kuchinke, “Language Models Explain Word Reading Times Better Than Empirical Predictability,” Frontiers in Artificial Intelligence, vol. 4, 2022.
[48] T. Linzen, “What Can Linguistics and Deep Learning Contribute to Each Other? Response to Pater,” Language, vol. 95, pp. e99–e108, 2019.
[49] K. Krippendorff, Content Analysis: An Introduction to Its Methodology. Sage Publications, Inc., 2012.
[50] C. H. Bjornsson, “Readability of Newspapers in 11 Languages,” Reading Research Quarterly, vol. 18, pp. 1–480, 1983.
[51] M. R. Montaño-Harmon and M. R. Montano-Harmon, “Discourse Features of Written Mexican Spanish: Current Research in Contrastive Rhetoric and Its Implications,” Hispania, vol. 74, pp. 1–417, 1991.
[52] J. M. Simpson, “Topical Structure Analysis of Academic Paragraphs in English and Spanish,” Journal of Second Language Writing, vol. 9, pp. 293–309, 2000.
[53] A. Wang and K. Cho, “BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model,” in Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation. Association for Computational Linguistics, 2019, pp. 30–36.
[54] T. Shen, V. Quach, R. Barzilay, and T. Jaakkola, “Blank Language Models,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2020, pp. 5186–5198.
[55] J. R. Landis and G. G. Koch, “The Measurement of Observer Agreement for Categorical Data,” Biometrics, vol. 33, p. 159, 3 1977.

\EOD