Enhancing elusive clues in knowledge learning by contrasting attention of language models

Jian Gao, Xiao Zhang, Ji Wu, Miao Li
Abstract

Causal language models acquire vast amount of knowledge from general text corpus during pretraining, but the efficiency of knowledge learning is known to be unsatisfactory, especially when learning from knowledge-dense and small-sized corpora. The deficiency can come from long-distance dependencies which are hard to capture by language models, and overfitting to co-occurrence patterns and distracting clues in the training text. To address these issues, the paper proposes a method to enhance knowledge learning during language model pretraining, by enhancing elusive but important clues in text discovered by the language model themselves. We found that larger language models pay more attention to non-obvious but important clues, which are often overlooked by smaller language models. Therefore, we can identify these clues by contrasting the attention weights of large and small language models. We use the identified clues as a guide to perform token-dropout data augmentation on the training text, and observed a significant boost in both small and large models’ performance in fact memorization. This shows that the behavior contrast between more and less-performant language models contains important clues for knowledge learning, and it can be “amplified” for a straight-forward improvement in knowledge learning efficiency.

Introduction

Pretrained large language models have shown impressive performance on a wide variety of downstream tasks (Ouyang et al. 2022; Chung et al. 2022; Touvron et al. 2023). To achieve good generalization, these models need to be trained on web-scale corpora that are diverse and large enough to capture the complexity of natural language. Unfortunately, it is observed that when training corpora is limited in size or style variation, language models can struggle to generalize the information learned from the corpora (Zhu and Li 2023). This deficiency poses a challenge for injecting knowledge into pretrained language models via continual pretraining (finetuning). In many domains, the available corpora is often limited and knowledge-dense (e.g., in forms of textbooks, manuals, documentations). Such domain text may be difficult to be utilized effectively in finetuning, and the language models may not be able to effectively generalize the domain knowledge to downstream domain tasks.

Not very much is known about the causes of such deficiency in knowledge learning. One likely cause is overfitting to co-occurrence patterns in the limited training text, causing learning of spurious correlations instead of correct factual associations. Another possible reason is the difficulty of capturing long-range dependencies in text, which are crucial for understanding complex relationships. Such deficiency is sometimes a result of intentional design choice in the model architecture, such as the decay of attention weights in the RoPE (Su et al. 2024) positional encodings.

One possible route to understanding this phenomenon is via the attention module in language models. The attention mechanism is a key component that allows the model to focus on different parts of the input when making predictions. The attention weights are shown to be interpretable and explaining the model’s behaviors (Clark et al. 2019).

Recently, Yüksekgönül et al. (2023) show that when predicting factual information, models are less likely to attend to the correct clue if the model does not know about the fact. This implies that for new knowledge unknown to the model, the model may not be able to attend to the correct clue at first, leading to difficulty in associating the correct clue (e.g., the head entity) with the prediction target (the tail entity).

To help language models learn, especially smaller models, a common approach is to use knowledge distillation (Hinton, Vinyals, and Dean 2015) (or teacher-student method) to transfer knowledge from a larger model. Given a learning goal, a more performant language model such as GPT-4 (OpenAI 2023) is often used to generate training data for the smaller model (Xu et al. 2024). A main drawback of this approach is that it requires the larger model to be already capable of the task or already have the knowledge. This make it not suitable for learning novel knowledge, such as new facts from an evolving domain. Also, it can only help the smaller model to learn but cannot help the larger model.

In this paper, we propose a simple method to enhance factual knowledge learning in continual pretraining, with the help of a pair of larger and smaller models. Our method is effective in learning novel facts and can boost the performance of both the larger and smaller models. The main contributions of the paper are as follows:

Attention difference between large and small language models reveals elusive but important clues in text. We show that while large and small language models both show high attention to important and obvious clues in text, large models pay significantly more attention than smaller models to important clues that are less obvious or elusive. Therefore, by contrasting the attention weights of large and small models, we can identify these elusive clues in text that are important for knowledge learning but are often easily overlooked.

Augmenting elusive clues in text boosts knowledge learning in continual pretraining. We show that by using the identified elusive clues as a guide, a token-dropout data augmentation that highlights the elusive clues can significantly boost the model’s performance in knowledge learning. We experimented on both synthetic and real-world corpus and show that the proposed method outperforms other forms of data augmentation, and boosting elusive clues universally helps both the large and the small models.

Unlike previous work which was focus on distilling knowledge from large language models to make small language models perform better, our approach has improved the performance of large language models on not only synthetic , but also real-world datasets.

To the best of our knowledge, we are the first to analyze the the attention discrepancies between large and small models and use it for data augmentation. Prior work have distilled attention pattern from large models to small models, but without analyzing what is being distilled. Unlike distillation, our approach also enhances the performance of large models, which is a novel contribution on our part.

We release the code and data used in this paper for reproducibility and further research111https://github.com/hushes-minutes/contrasting˙attention.

Related Work

Attention as behavior explanation

It is observed that attention weights in transformer models provide interpretable clues about the model’s behavior. For example, attention heads within multi-head attention can spontaneously differentiate into distinct roles (Clark et al. 2019). Certain heads play a more significant role and affect performance significantly (Voita et al. 2019). More performant models tend to have attention weights that focus more on key information and features, a possible explanation of their superior performance (Yüksekgönül et al. 2023).

Some argue that while attention is somewhat interpretable, its interpretability is not an indicator of model performance (Serrano and Smith 2019). There is divided opinion on the extent to which attention weights reflects true model behavior (Jain and Wallace 2019; Wiegreffe and Pinter 2019). Our study extends these findings by comparing and contrasting attention weights of different models, and show that the difference between attention weights of large and small models can provide important behavioral clues.

Data augmentation on text

Data augmentation is a critical technique for enhancing robustness and generalization, especially for limited-size datasets. Various data augmentation methods have been proposed, including random editing of sentences (Wei and Zou 2019) such as insertion, swapping, and deletion. Synonym replacement methods (Mosolova, Fomin, and Bondarenko 2018; Rizos, Hemker, and Schuller 2019) replace words with their synonyms. Contextual augmentation methods (Kobayashi 2018) replace words with other words predicted by a language model for semantic variations. Back-translation (Sennrich, Haddow, and Birch 2016; Edunov et al. 2018) is another commonly used method that generates augmented data by translating to and then back from another language. More sophisticated methods combine multiple augmentations (Xie et al. 2020; Karimi, Rossi, and Prati 2021).

Given that attention provides interpretable clues about the model’s behavior, Yu et al. (2022); Hailemariam et al. (2023) uses attention weights to find semantically significant words for replacement augmentation. Lewy and Mandziuk (2023) uses attention weights to find significant input parts for mixup augmentation (Zhang et al. 2018). We go a step further and show that only augmenting the most significant words is insufficient for challenging knowledge learning scenarios, and augmenting hard-to-notice but important parts of the input boosts the model’s performance even better than augmenting the significant parts.

Teacher-student methods for language models

To enhance the performance of smaller models, knowledge distillation methods have been extensively developed to transfer knowledge from larger models to smaller models (Hinton, Vinyals, and Dean 2015; Xu et al. 2024). Large pretrained language models can be used to generate data for finetuning smaller models to transfer its knowledge and skills, for example, instruction following (Wang et al. 2023; Chiang et al. 2023) and reasoning ability (Fu et al. 2023; Ho, Schmid, and Yun 2023). Distillation from large model is also frequently used to build strong domain or task-specific models with a compact size, like for coding (Gunasekar et al. 2023; Rozière et al. 2023) and math (Luo et al. 2023; Yue et al. 2023). Our work explores a different way to utilize large models: we find the behavior difference between large and small models and use it to guide the models towards more difficult part of the text.

Continual pretraining of language models

Continual pretraining takes a language model pretrained on a general corpus and continual the pretraining process with a new corpus, typically domain-specific text, to enhance the model’s performance on domain tasks. Model acquires new knowledge and ability via continual pretraining, for example, in coding (Chen et al. 2021), math (Lewkowycz et al. 2022), and medicine (Singhal et al. 2023). We aim at learning new factual knowledge from text via continual pretraining, similar to those in (Jang et al. 2022; Zhu and Li 2023).

Problem setup: knowledge learning deficiency

Task: fact learning in (continual) pretraining

Language models can learn factual knowledge from pretraining (or continual pretraining) on text corpora. Zhu and Li (2023) introduced a synthetic biography dataset for evaluating the efficiency of knowledge learning in language models. The dataset has been utilized by (Khalifa et al. 2024), (Golovneva et al. 2024), and (Saito et al. 2024). It consists of short synthetic biographies of individuals, with a fixed format shown in the following example:

Liam Thompson was born on January 5, 1990. He spent his early years in Melbourne, Australia. He received mentorship and guidance from faculty members at Sorbonne University. He completed his education with a focus on Biomedical Engineering. He had a professional role at the British Museum.

Each biography contains information about an individual’s name, birth date, birth city, education, and job status. The task is to finetune (continual pretraining) a language model on the biographies to let it memorize the factual information about the individuals. After training, the model is evaluated on a question-answering task, where we evaluate the model’s accuracy in memorizing the underlined part of the biographies.

The questions are formatted like “When was Liam Thompson born?”. Details on the training corpus and evaluation data are provided in the Appendix A.

Deficiency in knowledge learning over long-range dependency

Zhu and Li (2023) have shown that training language models from scratch on the biographies yield poor performance in question answering. We instead perform continual pretraining on pretrained language models up to 70 billion parameters. The language models have undergone extensive pretraining on massive corpora and show strong language capabilities.

We show that even pretrained models with billions of parameters struggle to memorize facts perfectly in continual pretraining. Table 1 shows that while Gemma 2 (Team et al. 2024) and LLaMA 3 (Dubey et al. 2024) memorize the first two pieces of information (birth date and birth city) with high accuracy, they struggle to memorize the following three pieces of information (university, major, and company). This rules out the possibility that the performance deficiency is due to limited model size or insufficient pretraining.

The performance trend on QA tasks is also plotted in Figure 1. It is clear that as the relationship spans longer distances (i.e., the distance between the tail entity, such as “Company”, to the head entity name, the person’s name), the model’s performance show a decreasing trend. This indicates that the model struggles to capture long-range dependencies in text, which is crucial for learning complex relationships.

Birth date Birth city University Major Company
LLaMA 3 8B EM 0.82 0.91 0.20 0.34 0.09
F1 0.90 0.93 0.55 0.41 0.11
LLaMA 3 70B EM 0.98 0.95 0.36 0.73 0.66
F1 1.00 0.98 0.67 0.77 0.67
Gemma 2 2B EM 0.98 0.99 0.12 0.54 0.15
F1 0.98 0.99 0.40 0.57 0.18
Gemma 2 9B EM 0.99 1.00 0.51 0.89 0.63
F1 1.00 1.00 0.66 0.90 0.64
Table 1: Performance on the QA task after continual pretraining on the biography corpus.
Refer to caption
Figure 1: Performance on the QA task show a decreasing trend as the distance between the head and tail entities in the relationship increases in the training text.

One possible reason for the deficiency in learning long-range dependencies is overfitting to a large amount of distracting information between the head and tail entities in a relationship. Overfitting is more likely when relationship only occur in few examples like in the biography dataset. Another possible reason comes from the bias in the model architecture that biases the model’s attention towards nearby information. Many popular models, such as LLaMA and Gemma, use the Rotary Position Embedding (RoPE) (Su et al. 2024) as positional encoding in their attention module. RoPE has a long-term decay property, which means that attention weights decay as the relative distance between the key and value token increases. This makes the model focus more on adjacent information but at a cost of important information that are occasionally far-away, hurting the model’s performance in learning long-range dependencies.

Analysis: contrasting attention of language models

We have shown that language models could achieve near-perfect accuracy in memorizing relationships that span a short distance in text, but struggle when they span a longer distance. In this section, we use attention weights as an interpretability tool to analyze the model’s behavior while learning long-range dependencies. We show that LLMs pay inadequately little attention to key information that is located further away, and more performant larger models can pay more attention to these information than smaller models.

Attention weight visualization

We look at model’s attention weights to try answering the following question: what information does the model pay attention to when predicting the tail entities in a relationship? The model uses attention weights to retrieve hidden states of context tokens, therefore the weights determines the information flow from the context to the current token in text. Furthermore, if an incorrect head entity is attended to when predicting the tail entity during the forward pass, in backpropagation the model will likely reinforce this incorrect association and cause the model to learn the wrong relationship.

To visualize model’s attention weights when predicting the tail entities in a relationship, we extract the attention weights at the preposition tokens, i.e., the word immediately preceding the tail entity. For example, in the sentence “He received mentorship and guidance from faculty members at Sorbonne University”, the attention from the token “at” is extracted. Because the model is predicting the tail entity “Sorbonne University” at this position, the attention weights222To simplify analysis, we took the approach of averaging the attention weights across all layers and attention heads, which will be further demonstrated in our Appendix. here likely corresponds to the information necessary for predicting it. To ease visualization and for better comparison, instead of directly showing the attention weights, we rank the tokens and visualize the top 10 tokens with the highest attention weights. For each model, we calculate the token attention ranking for 100 biographies333Because attention paid on meaningless tokens provides little information, we removed periods, commas, spaces, and placeholders at the beginning of a sentence(for example, <bos>)., and summarize the ranking using a bar plot in Figure 2.

Refer to caption
(a) Gemma 2 2B
Refer to caption
(b) LLaMA 3 8B
Refer to caption
(c) Gemma 2 9B
Refer to caption
(d) LLaMA 3 70B
Figure 2: Visualization of tokens receiving the highest attention weights, at the preposition just before the “company” field. Tokens in a sentence are ranked by attention weight, from large to small. Each bar in the graph show the constitution of the i-th ranked token from 100 biographies. “delimited-⟨⟩\langle...\rangle⟨ … ⟩” denotes tokens belonging to the information fields, and all else are individual tokens. All models pay most attention to the relationship words (e.g., “professional”, “role”, “at”), then to distrating entities in between (e.g., birth date, city, etc.). Even the larger models pay only a small amount of attention to the true head entity (name).

Results show that models assign the most attention to the most important information for predicting the tail entity: the relationship words. The model also pays much attention to the distracting entities in the preceding text. The correct head entity, which is the key information for predicting the tail entity, receives hardly any attention from smaller models and only a small amount of attention from larger models such as Gemma 2 9B, and is almost never ranked in top tokens. This indicates that the model’s attention is biased towards short-distance information, which may lead to the model learning the incorrect association and overfitting to such spurious co-occurrences.

Contrasting attention of large and small language models

Comparing to smaller models, larger language models tend to have overall better language understanding capabilities, therefore could be more likely to pay attention to the correct clue in the text. For a same family of models, for example, the LLaMA 3 8B and 70B models, the training corpus, model architecture, and training procedure are mostly similar, and they should have relatively similar general behavior pattern besides their capability differences.

Therefore, we can contrast the attention pattern between a large and a small model in the same family to identify the difference in the clue they pay attention to. In Figure 3, we subtract the attention weights of the small model from the large model, and visualize the top 10 tokens with the largest attention differences. The graph shows tokens receiving the most “additional” attention from the large model. It is clear that the correct head entity of the relationship, the “name” tokens (in red color), often receive the most additional attention444The date tokens (in blue color) also appear to rank high in attention differences, which is likely due to the fact that there are on average more date tokens than name tokens in the text, so they are counted more frequently in the top 10 tokens. For example, under the LLaMA 3 tokenizer, the name is split into an average of 3.56 tokens, while the date is split into around 7 tokens..

Comparing the original model attention in Figure 2 and the attention difference in Figure 3, we can see that while larger models pay more attention to the correct clue in text, the absolute attention weights on the correct clue is still small and biased towards the closer distracting entities. This calls for a method to “amplify” the attention differences so that the model can focus even more on the correct clue in text.

Refer to caption
(a) Gemma 2 9B///2B
Refer to caption
(b) LLaMA 3 70B///8B
Figure 3: Visualization of tokens receiving the highest additional attention weights from the large model compared to the small model. For example, the 9B///2B graph visualizes the distribution of the top 10 tokens with the largest attention_weight(Gemma 2_9B) - attention_weight(Gemma 2_2B) values. The name tokens (in red), the correct head entity, receive significant additional attention from the larger model.

Method: augmentation from contrasting attention

We have shown that important clues that are hard to notice in text can be discovered from the attention difference between large and small models. Next, we propose to utilize and amplify these clues by combining with a simple dropout data augmentation method.

Token-dropout data augmentation

To combat overfitting, token-dropout data augmentation is a simple and effective technique that randomly drops out tokens in a training example (Wei and Zou 2019). Token-dropout introduces noise to the training data and breaks the model’s reliance on spurious co-occurrences in the training examples, helping the model achieve better generalization. A naive token-dropout randomly deletes each token independently with a probability α𝛼\alphaitalic_α.

Augmentation guided by elusive clues

Refer to caption
Figure 4: Overview of the proposed data augmentation method based on attention difference between large and small models. Color represents retain probability of each token.

Although naive token-dropout mitigates overfitting, it does not solve the long-range dependency learning problem. As each token is dropped out independently, the model still suffers from inadequately small attention to non-obvious and distant information. We propose to use the attention difference between large and small models as a guide to dropout tokens in a more selective way. We first use the attention difference to rank the tokens in the training data, and then dropout tokens with a probability that is inversely proportional to their ranking. In this fashion, the model is encouraged to focus more on the tokens containing important but elusive information, as identified by the attention difference.

We use the following function to calculate dropout probability for each token:

p(r)=α(1eβr)𝑝𝑟𝛼1superscript𝑒𝛽𝑟p(r)=\alpha(1-e^{-\beta r})italic_p ( italic_r ) = italic_α ( 1 - italic_e start_POSTSUPERSCRIPT - italic_β italic_r end_POSTSUPERSCRIPT ) (1)

The token with the r𝑟ritalic_r-th rank (having the r𝑟ritalic_r-th largest attention difference) will be dropped out with probability p(r)𝑝𝑟p(r)italic_p ( italic_r ). (A graph of the function is shown in Figure 5). The hyperparameter β𝛽\betaitalic_β controls how fast the dropout probability increases with the ranking, and α𝛼\alphaitalic_α controls the maximum dropout probability. The tokens with higher attention differences will have lower dropout probabilities, encouraging the model to focus more on these tokens. Figure 4 illustrates the process of the proposed augmentation method.

Results

The biography dataset

We use low-rank adaptation (LoRA) (Hu et al. 2022) to facilitate finetuning of models up to 70 billion parameters. As the corpus size is limited, we use a rank of 16 for the LoRA adapters. Adapters are added to all of the model’s weights except for the embedding and the output layer. We finetune models with the Huggingface’s transformer library (Wolf et al. 2020) on NVIDIA 4090 GPUs. We experiment with LLaMA 3 (Dubey et al. 2024) and Gemma 2 (Team et al. 2024) as two families of language models.

For the baselines, we compare the performance of the models after plain finetuning, random (naive) token-dropout, and token-dropout by attention. In addition to random dropout, dropout by attention uses the original attention weights to guide the dropout probabilities, assuming that the model put more attention on tokens it deemed important. Tokens with lower attention weights are dropped out with higher probabilities to enhance the important information, in a similar vein as in (Yu et al. 2022; Hailemariam et al. 2023). The dropout probabilities are also calculated using Equation 1.

Hyper- QA performance
parameters University Company
𝜶𝜶\bm{\alpha}bold_italic_α 𝜷𝜷\bm{\beta}bold_italic_β EM F1 EM F1
Gemma 2 2B
Baselines
Plain finetuning - - 0.17 0.48 0.18 0.21
Random token-dropout 0.6 - 0.07 0.38 0.21 0.23
Token-dropout by attention 0.6 0.05 0.19 0.51 0.23 0.29
Ours
Token-dropout by attention diff 0.6 0.03 0.25 0.56 0.32 0.36
Gemma 2 9B
Baselines
Plain finetuning - - 0.61 0.78 0.63 0.64
Random token-dropout 0.7 - 0.52 0.73 0.51 0.57
Token-dropout by attention 0.6 0.05 0.49 0.62 0.44 0.47
Ours
Token-dropout by attention diff 0.6 0.03 0.84 0.92 0.90 0.92
LLaMA 3 8B
Baselines
Plain finetuning - - 0.30 0.55 0.17 0.21
Random token-dropout 0.6 - 0.11 0.49 0.24 0.29
Token-dropout by attention 0.6 0.05 0.24 0.62 0.21 0.28
Ours
Token-dropout by attention diff 0.7 0.05 0.29 0.64 0.42 0.53
LLaMA 3 70B
Baselines
Plain finetuning - - 0.42 0.69 0.66 0.67
Random token-dropout 0.6 - 0.71 0.86 0.71 0.78
Token-dropout by attention 0.7 0.05 0.51 0.75 0.61 0.68
Ours
Token-dropout by attention diff 0.7 0.01 0.90 0.96 0.96 0.96
Table 2: QA performance after continual pretraining on the biography corpus. Data augmentation based on attention difference significantly outperforms other data augmentation methods, for both small and large models.

For each experiment, we trained the model from 10 to 30 epochs with learning rates in [5e-5, 1e-3] and selected the model with the best performance. For the augmentation-based methods, we also searched for the best hyperparameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β individually for each method. Interestingly, the best hyperparameters for the dropout probabilities happen to be similar for different models and augmentation methods. For each of the augmentation methods, we generate 10 augmented versions of each training example and combine them with the original examples.

Results in Table 2 show that the proposed token-dropout augmentation based on attention difference significantly outperforms other data augmentation methods. We report QA accuracy on the “university” and the “company” fields as the models have poor performance on these fields under plain finetuning (Table 1). We report exact match (EM) accuracy and normalized word-level F1 scores. We can see that while random dropout and dropout by attention improve performance over no data augmentation, our method achieves much more significant improvement. This proves that contrasting attention of large and small language models indeed finds important but elusive clues in text effectively, and amplifying these clues in the input has immediate positive effects on the model’s memorization efficiency even for the largest 70B model.

Real-world dataset

Aside from the biography dataset, we also evaluate the proposed method on Wikipedia text to verify if the method helps knowledge learning on general text. Specifically, we evaluate on the Paragraph-Level Wikipedia Question-Answering dataset (Du and Cardie 2018). We first perform continual pretraining on the Wikipedia text paragraphs (included in the dataset), then evaluate the model’s performance on the question-answering data555This is the “closed-book” setting where the model is not allowed to look at the original Wikipedia passage during question answering. It tests the model’s ability to memorize factual knowledge during the continual pretraining phase.. The questions are specifically designed to incorporate coreference dependencies that span multiple sentences in a paragraph, making it a challenging task that tests the model’s ability to learn and memorize complex factual associations.

An example paragraph of Wikipedia text from the dataset is as follows:

The 2005 edition of the International ISBN Agency’s official manual describes how the 13-digit ISBN check digit is calculated. The ISBN-13 check digit, which is the last digit of the ISBN, must range from 0 to 9 and must be such that the sum of all the thirteen digits, each multiplied by its (integer) weight, alternating between 1 and 3, is a multiple of 10.

An example of the question from the dataset is as follows:

Question: How many digits does the ISBN have?
Answer: 13

Hyperparameters QA performance
𝜶𝜶\bm{\alpha}bold_italic_α 𝜷𝜷\bm{\beta}bold_italic_β EM F1
Gemma 2 2B
Baselines
Plain finetuning - - 0.126 0.215
Random token-dropout 0.7 - 0.12 0.2233
Token-dropout by attention 0.7 0.005 0.145 0.249
Ours
Token-dropout by attention diff 0.7 0.005 0.156 0.256
Gemma 2 9B
Baselines
Plain finetuning - - 0.186 0.2872
Random token-dropout 0.7 - 0.198 0.314
Token-dropout by attention 0.7 0.005 0.205 0.315
Ours
Token-dropout by attention diff 0.7 0.005 0.231 0.334
LLaMA 3 8B
Baselines
Plain finetuning - - 0.146 0.228
Random token-dropout 0.7 - 0.067 0.159
Token-dropout by attention 0.7 0.005 0.134 0.239
Ours
Token-dropout by attention diff 0.7 0.03 0.172 0.263
LLaMA 3 70B
Baselines
Plain finetuning - - 0.179 0.282
Random token-dropout 0.7 - 0.187 0.307
Token-dropout by attention 0.7 0.005 0.190 0.288
Ours
Token-dropout by attention diff 0.7 0.005 0.212 0.308
Table 3: QA performance after continual pretraining on the Wikipedia corpus. Data augmentation based on attention difference outperforms other data augmentation methods.

Results in Table 3 show that the proposed method also improves knowledge learning from the Wikipedia text. Naive data augmentation can negatively affect the model’s performance, while our method improves the model’s memorization efficiency by selectively amplifying difficult and elusive clues. This shows that enhancing the model’s focus on important but elusive information in a crucial factor in improving the model’s knowledge learning efficiency in pretraining, and our method is generally applicable to different kinds of text.

Conclusion

Efficiency of learning factual knowledge in not only crucial for pretraining, but also important for effective continual and lifelong learning in language models. Due to the overfitting and long-range dependency problem, even performant language models can struggle to learn and memorize factual knowledge from limited data. In this work, we show that one of the key factors to improving the model’s learning, finding the “elusive” but important clues in text, is already embedded in the model’s attention weights. However, such clues are hard to discover by the model itself due to the model’s bias towards short-range contexts, but clearly manifests themselves when contrasting the attention between a larger and a smaller model. Based on this discovery, we propose a simple yet effective data augmentation method that leverages the attention difference to guide the dropout of tokens in the input. Our method significantly improves the model’s performance in memorizing factual knowledge, and is shown to be effective for different corpora and models.

References

  • Chen et al. (2021) Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H. P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; Ray, A.; Puri, R.; Krueger, G.; Petrov, M.; Khlaaf, H.; Sastry, G.; Mishkin, P.; Chan, B.; Gray, S.; Ryder, N.; Pavlov, M.; Power, A.; Kaiser, L.; Bavarian, M.; Winter, C.; Tillet, P.; Such, F. P.; Cummings, D.; Plappert, M.; Chantzis, F.; Barnes, E.; Herbert-Voss, A.; Guss, W. H.; Nichol, A.; Paino, A.; Tezak, N.; Tang, J.; Babuschkin, I.; Balaji, S.; Jain, S.; Saunders, W.; Hesse, C.; Carr, A. N.; Leike, J.; Achiam, J.; Misra, V.; Morikawa, E.; Radford, A.; Knight, M.; Brundage, M.; Murati, M.; Mayer, K.; Welinder, P.; McGrew, B.; Amodei, D.; McCandlish, S.; Sutskever, I.; and Zaremba, W. 2021. Evaluating Large Language Models Trained on Code. CoRR, abs/2107.03374.
  • Chiang et al. (2023) Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J. E.; Stoica, I.; and Xing, E. P. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  • Chung et al. (2022) Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S.; Webson, A.; Gu, S. S.; Dai, Z.; Suzgun, M.; Chen, X.; Chowdhery, A.; Narang, S.; Mishra, G.; Yu, A.; Zhao, V. Y.; Huang, Y.; Dai, A. M.; Yu, H.; Petrov, S.; Chi, E. H.; Dean, J.; Devlin, J.; Roberts, A.; Zhou, D.; Le, Q. V.; and Wei, J. 2022. Scaling Instruction-Finetuned Language Models. CoRR, abs/2210.11416.
  • Clark et al. (2019) Clark, K.; Khandelwal, U.; Levy, O.; and Manning, C. D. 2019. What Does BERT Look At? An Analysis of BERT’s Attention. CoRR, abs/1906.04341.
  • Du and Cardie (2018) Du, X.; and Cardie, C. 2018. Harvesting Paragraph-level Question-Answer Pairs from Wikipedia. In Gurevych, I.; and Miyao, Y., eds., Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1907–1917. Melbourne, Australia: Association for Computational Linguistics.
  • Dubey et al. (2024) Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
  • Edunov et al. (2018) Edunov, S.; Ott, M.; Auli, M.; and Grangier, D. 2018. Understanding Back-Translation at Scale. In Riloff, E.; Chiang, D.; Hockenmaier, J.; and Tsujii, J., eds., Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, 489–500. Association for Computational Linguistics.
  • Fu et al. (2023) Fu, Y.; Peng, H.; Ou, L.; Sabharwal, A.; and Khot, T. 2023. Specializing Smaller Language Models towards Multi-Step Reasoning. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, 10421–10430. PMLR.
  • Golovneva et al. (2024) Golovneva, O.; Allen-Zhu, Z.; Weston, J.; and Sukhbaatar, S. 2024. Reverse Training to Nurse the Reversal Curse. ArXiv, abs/2403.13799.
  • Gunasekar et al. (2023) Gunasekar, S.; Zhang, Y.; Aneja, J.; Mendes, C. C. T.; Giorno, A. D.; Gopi, S.; Javaheripi, M.; Kauffmann, P.; de Rosa, G.; Saarikivi, O.; Salim, A.; Shah, S.; Behl, H. S.; Wang, X.; Bubeck, S.; Eldan, R.; Kalai, A. T.; Lee, Y. T.; and Li, Y. 2023. Textbooks Are All You Need. CoRR, abs/2306.11644.
  • Hailemariam et al. (2023) Hailemariam, M. Y.; Lynden, S. J.; Matono, A.; and Amagasa, T. 2023. Self-Attention-based Data Augmentation Method for Text Classification. In Proceedings of the 15th International Conference on Machine Learning and Computing, ICMLC 2023, Zhuhai, China, February 17-20, 2023, 239–244. ACM.
  • Hinton, Vinyals, and Dean (2015) Hinton, G. E.; Vinyals, O.; and Dean, J. 2015. Distilling the Knowledge in a Neural Network. CoRR, abs/1503.02531.
  • Ho, Schmid, and Yun (2023) Ho, N.; Schmid, L.; and Yun, S. 2023. Large Language Models Are Reasoning Teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, 14852–14882. Association for Computational Linguistics.
  • Hu et al. (2022) Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In The Tenth International Conference on Learning Representations, ICLR 2022. OpenReview.net.
  • Jain and Wallace (2019) Jain, S.; and Wallace, B. C. 2019. Attention is not Explanation. In Burstein, J.; Doran, C.; and Solorio, T., eds., Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), 3543–3556. Association for Computational Linguistics.
  • Jang et al. (2022) Jang, J.; Ye, S.; Yang, S.; Shin, J.; Han, J.; Kim, G.; Choi, S. J.; and Seo, M. 2022. Towards Continual Knowledge Learning of Language Models. In The Tenth International Conference on Learning Representations, ICLR 2022. OpenReview.net.
  • Karimi, Rossi, and Prati (2021) Karimi, A.; Rossi, L.; and Prati, A. 2021. AEDA: An Easier Data Augmentation Technique for Text Classification. In Moens, M.; Huang, X.; Specia, L.; and Yih, S. W., eds., Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, 2748–2754. Association for Computational Linguistics.
  • Khalifa et al. (2024) Khalifa, M.; Wadden, D.; Strubell, E.; Lee, H.; Wang, L.; Beltagy, I.; and Peng, H. 2024. Source-Aware Training Enables Knowledge Attribution in Language Models. ArXiv, abs/2404.01019.
  • Kobayashi (2018) Kobayashi, S. 2018. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations. In Walker, M. A.; Ji, H.; and Stent, A., eds., Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), 452–457. Association for Computational Linguistics.
  • Lewkowycz et al. (2022) Lewkowycz, A.; Andreassen, A.; Dohan, D.; Dyer, E.; Michalewski, H.; Ramasesh, V. V.; Slone, A.; Anil, C.; Schlag, I.; Gutman-Solo, T.; Wu, Y.; Neyshabur, B.; Gur-Ari, G.; and Misra, V. 2022. Solving Quantitative Reasoning Problems with Language Models. In NeurIPS.
  • Lewy and Mandziuk (2023) Lewy, D.; and Mandziuk, J. 2023. AttentionMix: Data augmentation method that relies on BERT attention mechanism. CoRR, abs/2309.11104.
  • Luo et al. (2023) Luo, H.; Sun, Q.; Xu, C.; Zhao, P.; Lou, J.; Tao, C.; Geng, X.; Lin, Q.; Chen, S.; and Zhang, D. 2023. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. CoRR, abs/2308.09583.
  • Mosolova, Fomin, and Bondarenko (2018) Mosolova, A.; Fomin, V.; and Bondarenko, I. 2018. Text Augmentation for Neural Networks. In van der Aalst, W. M. P.; Batagelj, V.; Glavas, G.; Ignatov, D. I.; Khachay, M. Y.; Koltsova, O.; Kuznetsov, S. O.; Lomazova, I. A.; Loukachevitch, N. V.; Napoli, A.; Panchenko, A.; Pardalos, P. M.; Pelillo, M.; and Savchenko, A. V., eds., Supplementary Proceedings of the Seventh International Conference on Analysis of Images, Social Networks and Texts (AIST 2018), Moscow, Russia, July 5 - 7, 2018, volume 2268 of CEUR Workshop Proceedings, 104–109. CEUR-WS.org.
  • OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. Technical report.
  • Ouyang et al. (2022) Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C. L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; Schulman, J.; Hilton, J.; Kelton, F.; Miller, L.; Simens, M.; Askell, A.; Welinder, P.; Christiano, P. F.; Leike, J.; and Lowe, R. 2022. Training language models to follow instructions with human feedback. In NeurIPS.
  • Rajpurkar et al. (2016) Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2383–2392. Association for Computational Linguistics.
  • Rizos, Hemker, and Schuller (2019) Rizos, G.; Hemker, K.; and Schuller, B. W. 2019. Augment to Prevent: Short-Text Data Augmentation in Deep Learning for Hate-Speech Classification. In Zhu, W.; Tao, D.; Cheng, X.; Cui, P.; Rundensteiner, E. A.; Carmel, D.; He, Q.; and Yu, J. X., eds., Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019, 991–1000. ACM.
  • Rozière et al. (2023) Rozière, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X. E.; Adi, Y.; Liu, J.; Remez, T.; Rapin, J.; Kozhevnikov, A.; Evtimov, I.; Bitton, J.; Bhatt, M.; Canton-Ferrer, C.; Grattafiori, A.; Xiong, W.; Défossez, A.; Copet, J.; Azhar, F.; Touvron, H.; Martin, L.; Usunier, N.; Scialom, T.; and Synnaeve, G. 2023. Code Llama: Open Foundation Models for Code. CoRR, abs/2308.12950.
  • Saito et al. (2024) Saito, K.; Sohn, K.; Lee, C.-Y.; and Ushiku, Y. 2024. Where is the answer? Investigating Positional Bias in Language Model Knowledge Extraction.
  • Sennrich, Haddow, and Birch (2016) Sennrich, R.; Haddow, B.; and Birch, A. 2016. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
  • Serrano and Smith (2019) Serrano, S.; and Smith, N. A. 2019. Is Attention Interpretable? In Korhonen, A.; Traum, D. R.; and Màrquez, L., eds., Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, 2931–2951. Association for Computational Linguistics.
  • Singhal et al. (2023) Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; Schaekermann, M.; Wang, A.; Amin, M.; Lachgar, S.; Mansfield, P. A.; Prakash, S.; Green, B.; Dominowska, E.; y Arcas, B. A.; Tomasev, N.; Liu, Y.; Wong, R.; Semturs, C.; Mahdavi, S. S.; Barral, J. K.; Webster, D. R.; Corrado, G. S.; Matias, Y.; Azizi, S.; Karthikesalingam, A.; and Natarajan, V. 2023. Towards Expert-Level Medical Question Answering with Large Language Models. CoRR, abs/2305.09617.
  • Su et al. (2024) Su, J.; Ahmed, M. H. M.; Lu, Y.; Pan, S.; Bo, W.; and Liu, Y. 2024. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing, 568: 127063.
  • Team et al. (2024) Team, G.; Riviere, M.; Pathak, S.; Sessa, P. G.; Hardin, C.; Bhupatiraju, S.; Hussenot, L.; Mesnard, T.; Shahriari, B.; Ramé, A.; et al. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118.
  • Touvron et al. (2023) Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample, G. 2023. LLaMA: Open and Efficient Foundation Language Models. CoRR, abs/2302.13971.
  • Voita et al. (2019) Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; and Titov, I. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. CoRR, abs/1905.09418.
  • Wang et al. (2023) Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N. A.; Khashabi, D.; and Hajishirzi, H. 2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Rogers, A.; Boyd-Graber, J. L.; and Okazaki, N., eds., Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, 13484–13508. Association for Computational Linguistics.
  • Wei and Zou (2019) Wei, J. W.; and Zou, K. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, 6381–6387. Association for Computational Linguistics.
  • Wiegreffe and Pinter (2019) Wiegreffe, S.; and Pinter, Y. 2019. Attention is not not Explanation. In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, 11–20. Association for Computational Linguistics.
  • Wolf et al. (2020) Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davison, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.; Xu, C.; Scao, T. L.; Gugger, S.; Drame, M.; Lhoest, Q.; and Rush, A. M. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, 38–45. Association for Computational Linguistics.
  • Xie et al. (2020) Xie, Q.; Dai, Z.; Hovy, E. H.; Luong, T.; and Le, Q. 2020. Unsupervised Data Augmentation for Consistency Training. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Xu et al. (2024) Xu, X.; Li, M.; Tao, C.; Shen, T.; Cheng, R.; Li, J.; Xu, C.; Tao, D.; and Zhou, T. 2024. A Survey on Knowledge Distillation of Large Language Models. CoRR, abs/2402.13116.
  • Yu et al. (2022) Yu, Y. J.; Yoon, S. J.; Jun, S. Y.; and Kim, J. W. 2022. TABAS: Text augmentation based on attention score for text classification model. ICT Express, 8(4): 549–554.
  • Yue et al. (2023) Yue, X.; Qu, X.; Zhang, G.; Fu, Y.; Huang, W.; Sun, H.; Su, Y.; and Chen, W. 2023. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning. CoRR, abs/2309.05653.
  • Yüksekgönül et al. (2023) Yüksekgönül, M.; Chandrasekaran, V.; Jones, E.; Gunasekar, S.; Naik, R.; Palangi, H.; Kamar, E.; and Nushi, B. 2023. Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models. CoRR, abs/2309.15098.
  • Zhang et al. (2018) Zhang, H.; Cissé, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2018. mixup: Beyond Empirical Risk Minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
  • Zhu and Li (2023) Zhu, Z. A.; and Li, Y. 2023. Physics of Language Models: Part 3.1, Knowledge Storage and Extraction. CoRR, abs/2309.14316.

Appendix A Data details

Synthetic dataset

The synthetic biography dataset is proposed and used in Zhu and Li (2023). The format of data is similar to Wikipedia passages. It contains multiple forms of information such as birth date, birth place, university and major. The dataset was not directly published by its original authors, and we reconstructed it based on the details and descriptions in their work. The dataset is also used by various related works, such as Khalifa et al. (2024), Golovneva et al. (2024), and Saito et al. (2024).

We used GPT-4 (OpenAI 2023) to generate the biography dataset. The prompt used for generating the dataset is as follows:

Please generate biographies for 100 individuals in the following format:

[name] was born on [month] [date], [year]. He/She spent his/her early years in [city], [country]. He/She received mentorship and guidance from faculty members at [university]. He/She completed his/her education with a focus on [major]. He/She had a professional role at [company].

Below are 2 sentences that have been written to standard, which you can refer to for the format:

“Adrian Wallace was born on March 15, 1975. He spent his early years in Toronto, Canada. He received mentorship and guidance from faculty members at McGill University. He completed his education with a focus on Artificial Intelligence. He had a professional role at the Pacific Tech Innovation Hub.”

“Sofia Ramirez was born on June 22, 1982. She spent her early years in Buenos Aires, Argentina. She received mentorship and guidance from faculty members at the University of São Paulo. She completed her education with a focus on Environmental Science and Policy. She had a professional role at the Green Earth Strategies Corp.”

Please consider the following requirements when generating the biographies:

1. Avoid using existing people’s names, especially those that are well-known.

2. Use real existing places, dates, companies, and majors in the generated biographies.

3. The names should not be repeated.

4. The birth places, universities, and companies should not be strongly associated (e.g., in a same city) for each individual.

For evaluating model’s performance, we use the following question answering task to ask about the 5 pieces of information in the biography about each individual:

Question: When was Liam Thompson born?

Answer: January 5, 1990.

Question: Where was Liam Thompson born?

Answer: Melbourne, Australia.

Question: Which university did Liam Thompson graduate from?

Answer: Sorbonne University.

Question: What did Liam Thompson study?

Answer: Biomedical Engineering.

Question: Where did Liam Thompson work?

Answer: the British Museum.

We use 5-shot prompt in the question answering task, where we provide the model with 5 randomly selected question-and-answer examples of the same format but for well-known individuals. This is to ensure that the model understands the task and the format of the answers. We evaluate the model’s performance using exact match (EM) accuracy and normalized word-level F1 score, used in the SQuAD question answering benchmark (Rajpurkar et al. 2016).

Appendix B Implementation details and discussion

Refer to caption
Figure 5: A graph of the function p(r)=α(1eβr)𝑝𝑟𝛼1superscript𝑒𝛽𝑟p(r)=\alpha(1-e^{-\beta r})italic_p ( italic_r ) = italic_α ( 1 - italic_e start_POSTSUPERSCRIPT - italic_β italic_r end_POSTSUPERSCRIPT ) used to generate dropout probabilities based on attention difference.

Definition of Distance

“Distance” in Section 3.2 and Figure 1 refers to the number of tokens between the tail entity and the head entity in text, for example, between the “company” field and the person’s name in the biography. The distance is calculated by subtracting the average position of the tokens in the tail entity and the average position of the head entity. In Figure 1, the x-axis marks the information fields with increasing distance. The figure demonstrates a pattern whereby the further away relevant information resides from the name tokens, the lower the model’s accuracy in assimilating this information.

Non-obvious but Important Clues

In the synthetic biography text, the head entity is an important clue and obvious to humans, but is not obvious to language models as seen in their low attention to the head entity (shown in Figure 2). It is not easy for language models to notice the head entity likely due to long distance and the presence of other distractions. We noticed that even large language models can only pay limited attention on those non-obvious but important clues, which provides evidence for the performance improvement of large language models such as LLaMA 3 70B on both synthetic and real-world datasets.

Generalization

The results in Table 2 and Table 3 have demonstrated that our approach can be generalised across different forms of the dataset. And we will show that our method is not confined on certain positions (for example, the end of the sentence) or certain tokens (for example, prepositions).

While we visualize attention at preposition tokens for better illustrating the idea, we show that using attention differences on all token can help enhance knowledge learning (on Wikipedia data) without depending on prepositions. For the biography data, when using attention differences on information fields instead of just on prepositions, the model achieved similar performance improvements with our method on LLaMA 2 7B in Table 4.

University Company
EM F1 EM F1
LLaMA 2 7B
Baselines
Plain finetuning 0.21 0.53 0.10 0.15
Random token-dropout 0.37 0.66 0.34 0.39
Token-dropout by attention 0.37 0.64 0.21 0.26
Ours
Dropout by all tokens attention diff 0.53 0.75 0.51 0.59
Dropout by preposition tokens attention diff 0.67 0.84 0.56 0.59
Table 4: QA performance after continual pretraining on the biography corpus. Data augmentation based on all attention difference also significantly outperforms other data augmentation methods.

Computational Costs

Since our approach uses a large-size language model and increases the size of data during data augmentation, the cost of models training may increased accordingly. However, we found that the cost increase is often mild in practice.

Firstly, when training a smaller language model, the larger language model is only used for inference and accounts for a small fraction of the computational costs compared to training. Secondly, the increase in training cost is manageable because the model requires significantly fewer epochs to converge with data augmentation. For example, LLaMA 2 70B converges at the 20th epoch without data augmentation, but only at the 6th epoch with data augmentation, as shown in Table 5.

Epoch Steps EM F1
LLaMA 2 70B
Plain finetuning
No Augmentation 20 2000 0.58 0.62
Ours
With Augmentation 3 3300 0.60 0.72
With Augmentation 6 6600 0.86 0.89
Table 5: Computational Costs and QA performance after continual pretraining on the biography corpus. The increase in actual training cost is manageable because the model requires significantly fewer epochs to converge with data augmentation.

Appendix C More results

Results for other models

Hyper- QA performance
parameters University Company
𝜶𝜶\bm{\alpha}bold_italic_α 𝜷𝜷\bm{\beta}bold_italic_β EM F1 EM F1
Gemma 2B
Baselines
Plain finetuning - - 0.04 0.44 0.04 0.05
Random token-dropout 0.6 - 0.08 0.47 0.08 0.10
Token-dropout by attention 0.7 0.05 0.13 0.51 0.07 0.10
Token-dropout by distance 0.7 0.1 0.17 0.53 0.19 0.24
Ours
Dropout by all tokens attention diff 0.6 0.05 0.16 0.54 0.15 0.17
Dropout by preposition attention diff 0.6 0.05 0.26 0.57 0.21 0.24
LLaMA 2 7B
Baselines
Plain finetuning - - 0.21 0.53 0.10 0.15
Random token-dropout 0.7 - 0.37 0.66 0.34 0.39
Token-dropout by attention 0.7 0.05 0.37 0.64 0.21 0.26
Token-dropout by distance 0.7 0.1 0.44 0.69 0.49 0.59
Ours
Dropout by all tokens attention diff 0.6 0.05 0.53 0.75 0.51 0.59
Dropout by preposition attention diff 0.7 0.05 0.67 0.84 0.56 0.59
LLaMA 2 70B
Baselines
Plain finetuning - - 0.48 0.71 0.58 0.62
Random token-dropout 0.7 - 0.71 0.86 0.68 0.72
Token-dropout by attention 0.7 0.05 0.70 0.83 0.77 0.80
Token-dropout by distance 0.7 0.1 0.79 0.89 0.87 0.95
Ours
Dropout by all tokens attention diff 0.6 0.05 0.75 0.88 0.71 0.80
Dropout by preposition attention diff 0.7 0.05 0.85 0.92 0.89 0.92
Table 6: QA performance after continual pretraining on the biography corpus. Data augmentation based on attention difference significantly outperforms other data augmentation methods, for both small and large models.

We have done experiments on LLaMA 2 and Gemma models on both the Synthetic dataset and Real-world dataset. Results are shown in Table 6 and Table 8, which strengthen the persuasiveness of our approach and demonstrate that the method can be generalised across different models and data. We introduced another two methods (distance dropout and all attention diff) in our experiments for illustrating the effectiveness and generality of our approach.

In the original synthetic biography dataset, the head entity (the person’s name) appears at the beginning of the sentence. For more general scenarios, we also consider a version of the biography dataset, where the head entity (the person’s name) appears in a random sentence within the paragraph. For example:

He was born on January 5, 1990. He spent his early years in Melbourne, Australia. Liam Thompson received mentorship and guidance from faculty members at Sorbonne University. He completed his education with a focus on Biomedical Engineering. He had a professional role at the British Museum.

We also implement a distance-based augmentation baseline that dropout tokens based on the distance between the head entity and the tail entity.

We test our method on this dataset and show that our method also outperforms the distance-based augmentation method, showing that our method can adaptively enhance clues on a random position based on its semantic significance rather than simply enhancing clues that are far-away. The results are listed in Table 7 (LLaMA 2):

University Company
EM F1 EM F1
LLaMA 2 7B
Baselines
Plain finetuning 0.14 0.47 0.27 0.29
Random token-dropout 0.21 0.55 0.29 0.31
Token-dropout by attention 0.16 0.24 0.30 0.58
Token-dropout by distance 0.18 0.54 0.41 0.45
Ours
Dropout by all tokens attention diff 0.34 0.64 0.62 0.66
Table 7: QA performance after continual pretraining on the biography corpus where people’s names appear at random positions. Data augmentation based on attention difference significantly outperforms other data augmentation methods, and the method which only dropout depending on distance does not perform well.
Hyperparameters QA performance
𝜶𝜶\bm{\alpha}bold_italic_α 𝜷𝜷\bm{\beta}bold_italic_β EM F1
Gemma 2B
Baselines
Plain finetuning - - 0.105 0.186
Random token-dropout 0.7 - 0.101 0.189
Token-dropout by attention 0.7 0.05 0.098 0.193
Ours
Dropout by preposition attention diff 0.7 0.005 0.115 0.204
LLaMA 2 7B
Baselines
Plain finetuning - - 0.138 0.231
Random token-dropout 0.7 - 0.126 0.217
Token-dropout by attention 0.7 0.05 0.123 0.217
Ours
Dropout by preposition attention diff 0.7 0.03 0.144 0.268
LLaMA 2 70B
Baselines
Plain finetuning - - 0.179 0.282
Random token-dropout 0.7 - 0.187 0.307
Token-dropout by attention 0.7 0.05 0.204 0.321
Ours
Dropout by preposition attention diff 0.7 0.005 0.219 0.331
Table 8: QA performance after continual pretraining on the Wikipedia corpus. Data augmentation based on attention difference outperforms other data augmentation methods.

Attention visualization

Based on the principle of simplicity, we show that a simple averaging of attention weights across all layers and heads already demonstrated the main idea of the paper and produces substantial improvements. We believe that analysis of individual layers and heads may lead to more specific and better results, and we leave this for future work.

Refer to caption
(a) Gemma 2 2B, original
Refer to caption
(b) Gemma 2 2B, after finetuning
Refer to caption
(c) Gemma 2 9B, original
Refer to caption
(d) Gemma 2 9B, after finetuning
Refer to caption
(e) LLaMA 3 8B, original
Refer to caption
(f) LLaMA 3 8B, after finetuning
Refer to caption
(g) LLaMA 3 70B, original
Refer to caption
(h) LLaMA 3 70B, after finetuning
Figure 6: Visualization of attention weights for different models, before and after finetuning. Tokens in a sentence are ranked by attention weights from large to small. Each bar in the graph show the constitution of the i-th ranked token in the 100 biographies. “delimited-⟨⟩\langle...\rangle⟨ … ⟩” denotes tokens belonging to the information fields, and all else are individual tokens.
Refer to caption
(a) Gemma 2 2B, trained/original
Refer to caption
(b) Gemma 2 9B, trained/original
Refer to caption
(c) LLaMA 3 8B, trained/original
Refer to caption
(d) LLaMA 3 70B, trained/original
Figure 7: Visualization of tokens receiving the highest additional attention weights from the trained model compared to the original model. For example, the trained/original graph visualizes the distribution of the top 10 tokens with the largest attention_weight(trained) - attention_weight(original) values.