Search | arXiv e-print repository

First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning

Authors: Yoichi Aoki, Keito Kudo, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Keisuke Sakaguchi, Kentaro Inui

Abstract: Multi-step reasoning is widely adopted in the community to explore the better performance of language models (LMs). We report on the systematic strategy that LMs use in this process. Our controlled experiments reveal that LMs rely more heavily on heuristics, such as lexical overlap, in the earlier stages of reasoning when more steps are required to reach an answer. Conversely, as LMs progress clos… ▽ More Multi-step reasoning is widely adopted in the community to explore the better performance of language models (LMs). We report on the systematic strategy that LMs use in this process. Our controlled experiments reveal that LMs rely more heavily on heuristics, such as lexical overlap, in the earlier stages of reasoning when more steps are required to reach an answer. Conversely, as LMs progress closer to the final answer, their reliance on heuristics decreases. This suggests that LMs track only a limited number of future steps and dynamically combine heuristic strategies with logical ones in tasks involving multi-step reasoning. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.12402 [pdf, other]

Flee the Flaw: Annotating the Underlying Logic of Fallacious Arguments Through Templates and Slot-filling

Authors: Irfan Robbani, Paul Reisert, Naoya Inoue, Surawat Pothong, Camélia Guerraoui, Wenzhi Wang, Shoichi Naito, Jungmin Choi, Kentaro Inui

Abstract: Prior research in computational argumentation has mainly focused on scoring the quality of arguments, with less attention on explicating logical errors. In this work, we introduce four sets of explainable templates for common informal logical fallacies designed to explicate a fallacy's implicit logic. Using our templates, we conduct an annotation study on top of 400 fallacious arguments taken from… ▽ More Prior research in computational argumentation has mainly focused on scoring the quality of arguments, with less attention on explicating logical errors. In this work, we introduce four sets of explainable templates for common informal logical fallacies designed to explicate a fallacy's implicit logic. Using our templates, we conduct an annotation study on top of 400 fallacious arguments taken from LOGIC dataset and achieve a high agreement score (Krippendorf's alpha of 0.54) and reasonable coverage (0.83). Finally, we conduct an experiment for detecting the structure of fallacies and discover that state-of-the-art language models struggle with detecting fallacy templates (0.47 accuracy). To facilitate research on fallacies, we make our dataset and guidelines publicly available. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.06032 [pdf, other]

The Curse of Popularity: Popular Entities have Catastrophic Side Effects when Deleting Knowledge from Language Models

Authors: Ryosuke Takahashi, Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, Kentaro Inui

Abstract: Language models (LMs) encode world knowledge in their internal parameters through training. However, LMs may learn personal and confidential information from the training data, leading to privacy concerns such as data leakage. Therefore, research on knowledge deletion from LMs is essential. This study focuses on the knowledge stored in LMs and analyzes the relationship between the side effects of… ▽ More Language models (LMs) encode world knowledge in their internal parameters through training. However, LMs may learn personal and confidential information from the training data, leading to privacy concerns such as data leakage. Therefore, research on knowledge deletion from LMs is essential. This study focuses on the knowledge stored in LMs and analyzes the relationship between the side effects of knowledge deletion and the entities related to the knowledge. Our findings reveal that deleting knowledge related to popular entities can have catastrophic side effects. Furthermore, this research is the first to analyze knowledge deletion in models trained on synthetic knowledge graphs, indicating a new direction for controlled experiments. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2405.04818 [pdf, other]

ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Authors: Ana Brassard, Benjamin Heinzerling, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

Abstract: Evaluating free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to gain insights into how LLMs evaluate explanation… ▽ More Evaluating free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to gain insights into how LLMs evaluate explanations. We observed that replacing one of the human ratings sometimes maintained, but more often lowered the inter-annotator agreement across different settings and quality aspects, suggesting that their judgments are not always consistent with human raters. We further quantified this difference by comparing the correlation between LLM-generated ratings with majority-voted human ratings across different quality aspects. With the best system, Spearman's rank correlation ranged between 0.53 to 0.95, averaging 0.72 across aspects, indicating moderately high but imperfect alignment. Finally, we considered the alternative of using an LLM as an additional rater when human raters are scarce, and measured the correlation between majority-voted labels with a limited human pool and LLMs as an additional rater, compared to the original gold labels. While GPT-4 improved the outcome when there were only two human raters, in all other observed cases, LLMs were neutral to detrimental when there were three or more human raters. We publicly release the dataset to support future improvements in LLM-in-the-loop evaluation here: https://github.com/a-brassard/ACORN. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: 18 pages, 7 figures, under review. Data available here: https://github.com/a-brassard/ACORN

arXiv:2404.11315 [pdf, other]

To Drop or Not to Drop? Predicting Argument Ellipsis Judgments: A Case Study in Japanese

Authors: Yukiko Ishizuki, Tatsuki Kuribayashi, Yuichiroh Matsubayashi, Ryohei Sasano, Kentaro Inui

Abstract: Speakers sometimes omit certain arguments of a predicate in a sentence; such omission is especially frequent in pro-drop languages. This study addresses a question about ellipsis -- what can explain the native speakers' ellipsis decisions? -- motivated by the interest in human discourse processing and writing assistance for this choice. To this end, we first collect large-scale human annotations o… ▽ More Speakers sometimes omit certain arguments of a predicate in a sentence; such omission is especially frequent in pro-drop languages. This study addresses a question about ellipsis -- what can explain the native speakers' ellipsis decisions? -- motivated by the interest in human discourse processing and writing assistance for this choice. To this end, we first collect large-scale human annotations of whether and why a particular argument should be omitted across over 2,000 data points in the balanced corpus of Japanese, a prototypical pro-drop language. The data indicate that native speakers overall share common criteria for such judgments and further clarify their quantitative characteristics, e.g., the distribution of related linguistic factors in the balanced corpus. Furthermore, the performance of the language model-based argument ellipsis judgment model is examined, and the gap between the systems' prediction and human judgments in specific linguistic aspects is revealed. We hope our fundamental resource encourages further studies on natural human ellipsis judgment. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: 13 pages; accepted by LREC-COLING 2024

arXiv:2403.12500 [pdf, other]

A Large Collection of Model-generated Contradictory Responses for Consistency-aware Dialogue Systems

Authors: Shiki Sato, Reina Akama, Jun Suzuki, Kentaro Inui

Abstract: Mitigating the generation of contradictory responses poses a substantial challenge in dialogue response generation. The quality and quantity of available contradictory response data play a vital role in suppressing these contradictions, offering two significant benefits. First, having access to large contradiction data enables a comprehensive examination of their characteristics. Second, data-driv… ▽ More Mitigating the generation of contradictory responses poses a substantial challenge in dialogue response generation. The quality and quantity of available contradictory response data play a vital role in suppressing these contradictions, offering two significant benefits. First, having access to large contradiction data enables a comprehensive examination of their characteristics. Second, data-driven methods to mitigate contradictions may be enhanced with large-scale contradiction data for training. Nevertheless, no attempt has been made to build an extensive collection of model-generated contradictory responses. In this paper, we build a large dataset of response generation models' contradictions for the first time. Then, we acquire valuable insights into the characteristics of model-generated contradictions through an extensive analysis of the collected responses. Lastly, we also demonstrate how this dataset substantially enhances the performance of data-driven contradiction suppression methods. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: 16 pages

arXiv:2403.10381 [pdf, other]

Monotonic Representation of Numeric Properties in Language Models

Authors: Benjamin Heinzerling, Kentaro Inui

Abstract: Language models (LMs) can express factual knowledge involving numeric properties such as Karl Popper was born in 1902. However, how this information is encoded in the model's internal representations is not understood well. Here, we introduce a simple method for finding and editing representations of numeric properties such as an entity's birth year. Empirically, we find low-dimensional subspaces… ▽ More Language models (LMs) can express factual knowledge involving numeric properties such as Karl Popper was born in 1902. However, how this information is encoded in the model's internal representations is not understood well. Here, we introduce a simple method for finding and editing representations of numeric properties such as an entity's birth year. Empirically, we find low-dimensional subspaces that encode numeric properties monotonically, in an interpretable and editable fashion. When editing representations along directions in these subspaces, LM output changes accordingly. For example, by patching activations along a "birthyear" direction we can make the LM express an increasingly late birthyear: Karl Popper was born in 1929, Karl Popper was born in 1957, Karl Popper was born in 1968. Property-encoding directions exist across several numeric properties in all models under consideration, suggesting the possibility that monotonic representation of numeric properties consistently emerges during LM pretraining. Code: https://github.com/bheinzerling/numeric-property-repr △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.03396 [pdf, other]

Japanese-English Sentence Translation Exercises Dataset for Automatic Grading

Authors: Naoki Miura, Hiroaki Funayama, Seiya Kikuchi, Yuichiroh Matsubayashi, Yuya Iwase, Kentaro Inui

Abstract: This paper proposes the task of automatic assessment of Sentence Translation Exercises (STEs), that have been used in the early stage of L2 language learning. We formalize the task as grading student responses for each rubric criterion pre-specified by the educators. We then create a dataset for STE between Japanese and English including 21 questions, along with a total of 3, 498 student responses… ▽ More This paper proposes the task of automatic assessment of Sentence Translation Exercises (STEs), that have been used in the early stage of L2 language learning. We formalize the task as grading student responses for each rubric criterion pre-specified by the educators. We then create a dataset for STE between Japanese and English including 21 questions, along with a total of 3, 498 student responses (167 on average). The answer responses were collected from students and crowd workers. Using this dataset, we demonstrate the performance of baselines including finetuned BERT and GPT models with few-shot in-context learning. Experimental results show that the baseline model with finetuned BERT was able to classify correct responses with approximately 90% in F1, but only less than 80% for incorrect responses. Furthermore, the GPT models with few-shot learning show poorer results than finetuned BERT, indicating that our newly proposed task presents a challenging issue, even for the stateof-the-art large language models. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: 9 pages

arXiv:2402.14411 [pdf, other]

J-UniMorph: Japanese Morphological Annotation through the Universal Feature Schema

Authors: Kosuke Matsuzaki, Masaya Taniguchi, Kentaro Inui, Keisuke Sakaguchi

Abstract: We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language's agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 infl… ▽ More We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language's agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 inflected forms for each word and is primarily dominated by denominal verbs (i.e., [noun] +suru (do-PRS)). Morphologically, this form is equivalent to the verb suru (do). In contrast, J-UniMorph explores a much broader and more frequently used range of verb forms, offering 118 inflected forms for each word on average. It includes honorifics, a range of politeness levels, and other linguistic nuances, emphasizing the distinctive characteristics of the Japanese language. This paper presents detailed statistics and characteristics of J-UniMorph, comparing it with the Wiktionary Edition. We release J-UniMorph and its interactive visualizer publicly available, aiming to support cross-linguistic research and various applications. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: 14 pages, 4 figures

arXiv:2310.17121 [pdf, other]

Test-time Augmentation for Factual Probing

Authors: Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, Kentaro Inui

Abstract: Factual probing is a method that uses prompts to test if a language model "knows" certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen… ▽ More Factual probing is a method that uses prompts to test if a language model "knows" certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen relation types. Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time. Experiments show improved model calibration, i.e., with TTA, model confidence better reflects prediction accuracy. Improvements in prediction accuracy are observed for some models, but for other models, TTA leads to degradation. Error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: 12 pages, 4 figures, accepted to EMNLP 2023 Findings (short paper)

arXiv:2310.15921 [pdf, other]

Contrastive Learning-based Sentence Encoders Implicitly Weight Informative Words

Authors: Hiroto Kurita, Goro Kobayashi, Sho Yokoi, Kentaro Inui

Abstract: The performance of sentence encoders can be significantly improved through the simple practice of fine-tuning using contrastive loss. A natural question arises: what characteristics do models acquire during contrastive learning? This paper theoretically and experimentally shows that contrastive-based sentence encoders implicitly weight words based on information-theoretic quantities; that is, more… ▽ More The performance of sentence encoders can be significantly improved through the simple practice of fine-tuning using contrastive loss. A natural question arises: what characteristics do models acquire during contrastive learning? This paper theoretically and experimentally shows that contrastive-based sentence encoders implicitly weight words based on information-theoretic quantities; that is, more informative words receive greater weight, while others receive less. The theory states that, in the lower bound of the optimal value of the contrastive learning objective, the norm of word embedding reflects the information gain associated with the distribution of surrounding words. We also conduct comprehensive experiments using various models, multiple datasets, two methods to measure the implicit weighting of models (Integrated Gradients and SHAP), and two information-theoretic quantities (information gain and self-information). The results provide empirical evidence that contrastive fine-tuning emphasizes informative words. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: 16 pages, 6 figures, accepted to EMNLP 2023 Findings (short paper)

arXiv:2308.01044 [pdf, other]

doi 10.18653/v1/2022.eval4nlp-1.9

Chat Translation Error Detection for Assisting Cross-lingual Communications

Authors: Yunmeng Li, Jun Suzuki, Makoto Morishita, Kaori Abe, Ryoko Tokuhisa, Ana Brassard, Kentaro Inui

Abstract: In this paper, we describe the development of a communication support system that detects erroneous translations to facilitate crosslingual communications due to the limitations of current machine chat translation methods. We trained an error detector as the baseline of the system and constructed a new Japanese-English bilingual chat corpus, BPersona-chat, which comprises multiturn colloquial chat… ▽ More In this paper, we describe the development of a communication support system that detects erroneous translations to facilitate crosslingual communications due to the limitations of current machine chat translation methods. We trained an error detector as the baseline of the system and constructed a new Japanese-English bilingual chat corpus, BPersona-chat, which comprises multiturn colloquial chats augmented with crowdsourced quality ratings. The error detector can serve as an encouraging foundation for more advanced erroneous translation detection systems. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Journal ref: Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems, pages 88-95, November 2022, Online. Association for Computational Linguistics

arXiv:2307.15341 [pdf, other]

Teach Me How to Improve My Argumentation Skills: A Survey on Feedback in Argumentation

Authors: Camélia Guerraoui, Paul Reisert, Naoya Inoue, Farjana Sultana Mim, Shoichi Naito, Jungmin Choi, Irfan Robbani, Wenzhi Wang, Kentaro Inui

Abstract: The use of argumentation in education has been shown to improve critical thinking skills for end-users such as students, and computational models for argumentation have been developed to assist in this process. Although these models are useful for evaluating the quality of an argument, they oftentimes cannot explain why a particular argument is considered poor or not, which makes it difficult to p… ▽ More The use of argumentation in education has been shown to improve critical thinking skills for end-users such as students, and computational models for argumentation have been developed to assist in this process. Although these models are useful for evaluating the quality of an argument, they oftentimes cannot explain why a particular argument is considered poor or not, which makes it difficult to provide constructive feedback to users to strengthen their critical thinking skills. In this survey, we aim to explore the different dimensions of feedback (Richness, Visualization, Interactivity, and Personalization) provided by the current computational models for argumentation, and the possibility of enhancing the power of explanations of such models, ultimately helping learners improve their critical thinking skills. △ Less

Submitted 28 July, 2023; originally announced July 2023.

Comments: 14 pages, 4 figures

arXiv:2305.18294 [pdf, other]

Transformer Language Models Handle Word Frequency in Prediction Head

Authors: Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, Kentaro Inui

Abstract: Prediction head is a crucial component of Transformer language models. Despite its direct impact on prediction, this component has often been overlooked in analyzing Transformers. In this study, we investigate the inner workings of the prediction head, specifically focusing on bias parameters. Our experiments with BERT and GPT-2 models reveal that the biases in their word prediction heads play a s… ▽ More Prediction head is a crucial component of Transformer language models. Despite its direct impact on prediction, this component has often been overlooked in analyzing Transformers. In this study, we investigate the inner workings of the prediction head, specifically focusing on bias parameters. Our experiments with BERT and GPT-2 models reveal that the biases in their word prediction heads play a significant role in the models' ability to reflect word frequency in a corpus, aligning with the logit adjustment method commonly used in long-tailed learning. We also quantify the effect of controlling the biases in practical auto-regressive text generation scenarios; under a particular setting, more diverse text can be generated without compromising text quality. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: 11 pages, 12 figures, accepted to ACL 2023 Findings (short paper)

arXiv:2303.14342 [pdf, other]

Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error Correction

Authors: Steven Coyne, Keisuke Sakaguchi, Diana Galvan-Sosa, Michael Zock, Kentaro Inui

Abstract: GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks. However, there is a relative lack of detailed published analysis of their performance on the task of grammatical error correction (GEC). To address this, we perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC b… ▽ More GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks. However, there is a relative lack of detailed published analysis of their performance on the task of grammatical error correction (GEC). To address this, we perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC benchmarks. We compare the performance of different prompts in both zero-shot and few-shot settings, analyzing intriguing or problematic outputs encountered with different prompt formats. We report the performance of our best prompt on the BEA-2019 and JFLEG datasets, finding that the GPT models can perform well in a sentence-level revision setting, with GPT-4 achieving a new high score on the JFLEG benchmark. Through human evaluation experiments, we compare the GPT models' corrections to source, human reference, and baseline GEC system sentences and observe differences in editing strategies and how they are scored by human raters. △ Less

Submitted 30 May, 2023; v1 submitted 24 March, 2023; originally announced March 2023.

arXiv:2302.08148 [pdf, other]

Empirical Investigation of Neural Symbolic Reasoning Strategies

Authors: Yoichi Aoki, Keito Kudo, Tatsuki Kuribayashi, Ana Brassard, Masashi Yoshikawa, Keisuke Sakaguchi, Kentaro Inui

Abstract: Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear. Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1,… ▽ More Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear. Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of reasoning strategies significantly affects the performance, with the gap becoming even larger as the extrapolation length becomes longer. Surprisingly, we also found that certain configurations lead to nearly perfect performance, even in the case of length extrapolation. Our results indicate the importance of further exploring effective strategies for neural reasoning models. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Comments: This paper is accepted as the findings at EACL 2023, and the earlier version (non-archival) of this work got the Best Paper Award in the Student Research Workshop of AACL 2022

arXiv:2302.07866 [pdf, other]

Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?

Authors: Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Ana Brassard, Masashi Yoshikawa, Keisuke Sakaguchi, Kentaro Inui

Abstract: Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a ski… ▽ More Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a skill tree on compositionality in arithmetic symbolic reasoning that defines the hierarchical levels of complexity along with three compositionality dimensions: systematicity, productivity, and substitutivity. Our experiments revealed that among the three types of composition, the models struggled most with systematicity, performing poorly even with relatively simple compositions. That difficulty was not resolved even after training the models with intermediate reasoning steps. △ Less

Submitted 15 February, 2023; originally announced February 2023.

Comments: accepted by EACL 2023

arXiv:2302.00456 [pdf, other]

Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps

Authors: Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, Kentaro Inui

Abstract: Transformers are ubiquitous in wide tasks. Interpreting their internals is a pivotal goal. Nevertheless, their particular components, feed-forward (FF) blocks, have typically been less analyzed despite their substantial parameter amounts. We analyze the input contextualization effects of FF blocks by rendering them in the attention maps as a human-friendly visualization scheme. Our experiments wit… ▽ More Transformers are ubiquitous in wide tasks. Interpreting their internals is a pivotal goal. Nevertheless, their particular components, feed-forward (FF) blocks, have typically been less analyzed despite their substantial parameter amounts. We analyze the input contextualization effects of FF blocks by rendering them in the attention maps as a human-friendly visualization scheme. Our experiments with both masked- and causal-language models reveal that FF networks modify the input contextualization to emphasize specific types of linguistic compositions. In addition, FF and its surrounding components tend to cancel out each other's effects, suggesting potential redundancy in the processing of the Transformer layer. △ Less

Submitted 15 April, 2024; v1 submitted 1 February, 2023; originally announced February 2023.

Comments: ICLR 2024 Spotlight; 37 pages, 32 figures, 3 tables

arXiv:2301.06758 [pdf, other]

Tracing and Manipulating Intermediate Values in Neural Math Problem Solvers

Authors: Yuta Matsumoto, Benjamin Heinzerling, Masashi Yoshikawa, Kentaro Inui

Abstract: How language models process complex input that requires multiple steps of inference is not well understood. Previous research has shown that information about intermediate values of these inputs can be extracted from the activations of the models, but it is unclear where that information is encoded and whether that information is indeed used during inference. We introduce a method for analyzing ho… ▽ More How language models process complex input that requires multiple steps of inference is not well understood. Previous research has shown that information about intermediate values of these inputs can be extracted from the activations of the models, but it is unclear where that information is encoded and whether that information is indeed used during inference. We introduce a method for analyzing how a Transformer model processes these inputs by focusing on simple arithmetic problems and their intermediate values. To trace where information about intermediate values is encoded, we measure the correlation between intermediate values and the activations of the model using principal component analysis (PCA). Then, we perform a causal intervention by manipulating model weights. This intervention shows that the weights identified via tracing are not merely correlated with intermediate values, but causally related to model predictions. Our findings show that the model has a locality to certain intermediate values, and this is useful for enhancing the interpretability of the models. △ Less

Submitted 17 January, 2023; originally announced January 2023.

Comments: 5 pages, 4 figures, MathNLP

arXiv:2211.01432 [pdf, other]

Cross-stitching Text and Knowledge Graph Encoders for Distantly Supervised Relation Extraction

Authors: Qin Dai, Benjamin Heinzerling, Kentaro Inui

Abstract: Bi-encoder architectures for distantly-supervised relation extraction are designed to make use of the complementary information found in text and knowledge graphs (KG). However, current architectures suffer from two drawbacks. They either do not allow any sharing between the text encoder and the KG encoder at all, or, in case of models with KG-to-text attention, only share information in one direc… ▽ More Bi-encoder architectures for distantly-supervised relation extraction are designed to make use of the complementary information found in text and knowledge graphs (KG). However, current architectures suffer from two drawbacks. They either do not allow any sharing between the text encoder and the KG encoder at all, or, in case of models with KG-to-text attention, only share information in one direction. Here, we introduce cross-stitch bi-encoders, which allow full interaction between the text encoder and the KG encoder via a cross-stitch mechanism. The cross-stitch mechanism allows sharing and updating representations between the two encoders at any layer, with the amount of sharing being dynamically controlled via cross-attention-based gates. Experimental results on two relation extraction benchmarks from two different domains show that enabling full interaction between the two encoders yields strong improvements. △ Less

Submitted 2 November, 2022; originally announced November 2022.

arXiv:2209.09746 [pdf, other]

Target-Guided Open-Domain Conversation Planning

Authors: Yosuke Kishinami, Reina Akama, Shiki Sato, Ryoko Tokuhisa, Jun Suzuki, Kentaro Inui

Abstract: Prior studies addressing target-oriented conversational tasks lack a crucial notion that has been intensively studied in the context of goal-oriented artificial intelligence agents, namely, planning. In this study, we propose the task of Target-Guided Open-Domain Conversation Planning (TGCP) task to evaluate whether neural conversational agents have goal-oriented conversation planning abilities. U… ▽ More Prior studies addressing target-oriented conversational tasks lack a crucial notion that has been intensively studied in the context of goal-oriented artificial intelligence agents, namely, planning. In this study, we propose the task of Target-Guided Open-Domain Conversation Planning (TGCP) task to evaluate whether neural conversational agents have goal-oriented conversation planning abilities. Using the TGCP task, we investigate the conversation planning abilities of existing retrieval models and recent strong generative models. The experimental results reveal the challenges facing current technology. △ Less

Submitted 20 September, 2022; originally announced September 2022.

Comments: 9 pages, Accepted to The 29th International Conference on Computational Linguistics (COLING 2022)

arXiv:2208.02578 [pdf, other]

N-best Response-based Analysis of Contradiction-awareness in Neural Response Generation Models

Authors: Shiki Sato, Reina Akama, Hiroki Ouchi, Ryoko Tokuhisa, Jun Suzuki, Kentaro Inui

Abstract: Avoiding the generation of responses that contradict the preceding context is a significant challenge in dialogue response generation. One feasible method is post-processing, such as filtering out contradicting responses from a resulting n-best response list. In this scenario, the quality of the n-best list considerably affects the occurrence of contradictions because the final response is chosen… ▽ More Avoiding the generation of responses that contradict the preceding context is a significant challenge in dialogue response generation. One feasible method is post-processing, such as filtering out contradicting responses from a resulting n-best response list. In this scenario, the quality of the n-best list considerably affects the occurrence of contradictions because the final response is chosen from this n-best list. This study quantitatively analyzes the contextual contradiction-awareness of neural response generation models using the consistency of the n-best lists. Particularly, we used polar questions as stimulus inputs for concise and quantitative analyses. Our tests illustrate the contradiction-awareness of recent neural response generation models and methodologies, followed by a discussion of their properties and limitations. △ Less

Submitted 4 August, 2022; originally announced August 2022.

Comments: 8 pages, Accepted to The 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2022)

arXiv:2207.13332 [pdf, other]

RealTime QA: What's the Answer Right Now?

Authors: Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A. Smith, Yejin Choi, Kentaro Inui

Abstract: We introduce REALTIME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). REALTIME QA inquires about the current world, and QA systems need to answer questions about novel events or information. It therefore challenges static, conventional assumptions in open-domain QA datasets and pursues instantaneous applicat… ▽ More We introduce REALTIME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). REALTIME QA inquires about the current world, and QA systems need to answer questions about novel events or information. It therefore challenges static, conventional assumptions in open-domain QA datasets and pursues instantaneous applications. We build strong baseline models upon large pretrained language models, including GPT-3 and T5. Our benchmark is an ongoing effort, and this paper presents real-time evaluation results over the past year. Our experimental results show that GPT-3 can often properly update its generation results, based on newly-retrieved documents, highlighting the importance of up-to-date information retrieval. Nonetheless, we find that GPT-3 tends to return outdated answers when retrieved documents do not provide sufficient information to find an answer. This suggests an important avenue for future research: can an open-domain QA system identify such unanswerable cases and communicate with the user or even the retrieval module to modify the retrieval results? We hope that REALTIME QA will spur progress in instantaneous applications of question answering and beyond. △ Less

Submitted 28 February, 2024; v1 submitted 27 July, 2022; originally announced July 2022.

Comments: RealTime QA Website: https://realtimeqa.github.io/

arXiv:2206.08288 [pdf, other]

Balancing Cost and Quality: An Exploration of Human-in-the-loop Frameworks for Automated Short Answer Scoring

Authors: Hiroaki Funayama, Tasuku Sato, Yuichiroh Matsubayashi, Tomoya Mizumoto, Jun Suzuki, Kentaro Inui

Abstract: Short answer scoring (SAS) is the task of grading short text written by a learner. In recent years, deep-learning-based approaches have substantially improved the performance of SAS models, but how to guarantee high-quality predictions still remains a critical issue when applying such models to the education field. Towards guaranteeing high-quality predictions, we present the first study of explor… ▽ More Short answer scoring (SAS) is the task of grading short text written by a learner. In recent years, deep-learning-based approaches have substantially improved the performance of SAS models, but how to guarantee high-quality predictions still remains a critical issue when applying such models to the education field. Towards guaranteeing high-quality predictions, we present the first study of exploring the use of human-in-the-loop framework for minimizing the grading cost while guaranteeing the grading quality by allowing a SAS model to share the grading task with a human grader. Specifically, by introducing a confidence estimation method for indicating the reliability of the model predictions, one can guarantee the scoring quality by utilizing only predictions with high reliability for the scoring results and casting predictions with low reliability to human graders. In our experiments, we investigate the feasibility of the proposed framework using multiple confidence estimation methods and multiple SAS datasets. We find that our human-in-the-loop framework allows automatic scoring models and human graders to achieve the target scoring quality. △ Less

Submitted 16 June, 2022; originally announced June 2022.

Comments: 12pages, To be published in proceedings of AIED2022

arXiv:2205.11833 [pdf, other]

Diverse Lottery Tickets Boost Ensemble from a Single Pretrained Model

Authors: Sosuke Kobayashi, Shun Kiyono, Jun Suzuki, Kentaro Inui

Abstract: Ensembling is a popular method used to improve performance as a last resort. However, ensembling multiple models finetuned from a single pretrained model has been not very effective; this could be due to the lack of diversity among ensemble members. This paper proposes Multi-Ticket Ensemble, which finetunes different subnetworks of a single pretrained model and ensembles them. We empirically demon… ▽ More Ensembling is a popular method used to improve performance as a last resort. However, ensembling multiple models finetuned from a single pretrained model has been not very effective; this could be due to the lack of diversity among ensemble members. This paper proposes Multi-Ticket Ensemble, which finetunes different subnetworks of a single pretrained model and ensembles them. We empirically demonstrated that winning-ticket subnetworks produced more diverse predictions than dense networks, and their ensemble outperformed the standard ensemble on some tasks. △ Less

Submitted 24 May, 2022; originally announced May 2022.

Comments: Workshop on Challenges & Perspectives in Creating Large Language Models (BigScience) 2022

arXiv:2205.11484 [pdf, other]

Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond

Authors: Masato Mita, Keisuke Sakaguchi, Masato Hagiwara, Tomoya Mizumoto, Jun Suzuki, Kentaro Inui

Abstract: Natural language processing technology has rapidly improved automated grammatical error correction tasks, and the community begins to explore document-level revision as one of the next challenges. To go beyond sentence-level automated grammatical error correction to NLP-based document-level revision assistant, there are two major obstacles: (1) there are few public corpora with document-level revi… ▽ More Natural language processing technology has rapidly improved automated grammatical error correction tasks, and the community begins to explore document-level revision as one of the next challenges. To go beyond sentence-level automated grammatical error correction to NLP-based document-level revision assistant, there are two major obstacles: (1) there are few public corpora with document-level revisions being annotated by professional editors, and (2) it is not feasible to elicit all possible references and evaluate the quality of revision with such references because there are infinite possibilities of revision. This paper tackles these challenges. First, we introduce a new document-revision corpus, TETRA, where professional editors revised academic papers sampled from the ACL anthology which contain few trivial grammatical errors that enable us to focus more on document- and paragraph-level edits such as coherence and consistency. Second, we explore reference-less and interpretable methods for meta-evaluation that can detect quality improvements by document revision. We show the uniqueness of TETRA compared with existing document revision corpora and demonstrate that a fine-tuned pre-trained language model can discriminate the quality of documents after revision even when the difference is subtle. This promising result will encourage the community to further explore automated document revision models and metrics in future. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: 14 pages

arXiv:2205.11463 [pdf, other]

Context Limitations Make Neural Language Models More Human-Like

Authors: Tatsuki Kuribayashi, Yohei Oseki, Ana Brassard, Kentaro Inui

Abstract: Language models (LMs) have been used in cognitive modeling as well as engineering studies -- they compute information-theoretic complexity metrics that simulate humans' cognitive load during reading. This study highlights a limitation of modern neural LMs as the model of choice for this purpose: there is a discrepancy between their context access capacities and that of humans. Our results showed t… ▽ More Language models (LMs) have been used in cognitive modeling as well as engineering studies -- they compute information-theoretic complexity metrics that simulate humans' cognitive load during reading. This study highlights a limitation of modern neural LMs as the model of choice for this purpose: there is a discrepancy between their context access capacities and that of humans. Our results showed that constraining the LMs' context access improved their simulation of human reading behavior. We also showed that LM-human gaps in context access were associated with specific syntactic constructions; incorporating syntactic biases into LMs' context access might enhance their cognitive plausibility. △ Less

Submitted 1 November, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: Accepted by EMNLP2022 (main long)

arXiv:2204.01512 [pdf, other]

LPAttack: A Feasible Annotation Scheme for Capturing Logic Pattern of Attacks in Arguments

Authors: Farjana Sultana Mim, Naoya Inoue, Shoichi Naito, Keshav Singh, Kentaro Inui

Abstract: In argumentative discourse, persuasion is often achieved by refuting or attacking others arguments. Attacking is not always straightforward and often comprise complex rhetorical moves such that arguers might agree with a logic of an argument while attacking another logic. Moreover, arguer might neither deny nor agree with any logics of an argument, instead ignore them and attack the main stance of… ▽ More In argumentative discourse, persuasion is often achieved by refuting or attacking others arguments. Attacking is not always straightforward and often comprise complex rhetorical moves such that arguers might agree with a logic of an argument while attacking another logic. Moreover, arguer might neither deny nor agree with any logics of an argument, instead ignore them and attack the main stance of the argument by providing new logics and presupposing that the new logics have more value or importance than the logics present in the attacked argument. However, no existing studies in the computational argumentation capture such complex rhetorical moves in attacks or the presuppositions or value judgements in them. In order to address this gap, we introduce LPAttack, a novel annotation scheme that captures the common modes and complex rhetorical moves in attacks along with the implicit presuppositions and value judgements in them. Our annotation study shows moderate inter-annotator agreement, indicating that human annotation for the proposed scheme is feasible. We publicly release our annotated corpus and the annotation guidelines. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: 14 pages, 8 figures

arXiv:2201.06777 [pdf, other]

COPA-SSE: Semi-structured Explanations for Commonsense Reasoning

Authors: Ana Brassard, Benjamin Heinzerling, Pride Kavumba, Kentaro Inui

Abstract: We present Semi-Structured Explanations for COPA (COPA-SSE), a new crowdsourced dataset of 9,747 semi-structured, English common sense explanations for Choice of Plausible Alternatives (COPA) questions. The explanations are formatted as a set of triple-like common sense statements with ConceptNet relations but freely written concepts. This semi-structured format strikes a balance between the high… ▽ More We present Semi-Structured Explanations for COPA (COPA-SSE), a new crowdsourced dataset of 9,747 semi-structured, English common sense explanations for Choice of Plausible Alternatives (COPA) questions. The explanations are formatted as a set of triple-like common sense statements with ConceptNet relations but freely written concepts. This semi-structured format strikes a balance between the high quality but low coverage of structured data and the lower quality but high coverage of free-form crowdsourcing. Each explanation also includes a set of human-given quality ratings. With their familiar format, the explanations are geared towards commonsense reasoners operating on knowledge graphs and serve as a starting point for ongoing work on improving such systems. The dataset is available at https://github.com/a-brassard/copa-sse. △ Less

Submitted 11 May, 2022; v1 submitted 18 January, 2022; originally announced January 2022.

Comments: 6 pages, 6 figures, LREC 2022. Data available at https://github.com/a-brassard/copa-sse

arXiv:2201.06674 [pdf, other]

TYPIC: A Corpus of Template-Based Diagnostic Comments on Argumentation

Authors: Shoichi Naito, Shintaro Sawada, Chihiro Nakagawa, Naoya Inoue, Kenshi Yamaguchi, Iori Shimizu, Farjana Sultana Mim, Keshav Singh, Kentaro Inui

Abstract: Providing feedback on the argumentation of the learner is essential for developing critical thinking skills, however, it requires a lot of time and effort. To mitigate the overload on teachers, we aim to automate a process of providing feedback, especially giving diagnostic comments which point out the weaknesses inherent in the argumentation. It is recommended to give specific diagnostic comments… ▽ More Providing feedback on the argumentation of the learner is essential for developing critical thinking skills, however, it requires a lot of time and effort. To mitigate the overload on teachers, we aim to automate a process of providing feedback, especially giving diagnostic comments which point out the weaknesses inherent in the argumentation. It is recommended to give specific diagnostic comments so that learners can recognize the diagnosis without misinterpretation. However, it is not obvious how the task of providing specific diagnostic comments should be formulated. We present a formulation of the task as template selection and slot filling to make an automatic evaluation easier and the behavior of the model more tractable. The key to the formulation is the possibility of creating a template set that is sufficient for practical use. In this paper, we define three criteria that a template set should satisfy: expressiveness, informativeness, and uniqueness, and verify the feasibility of creating a template set that satisfies these criteria as a first trial. We will show that it is feasible through an annotation study that converts diagnostic comments given in a text to a template format. The corpus used in the annotation study is publicly available. △ Less

Submitted 21 June, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

Comments: LREC2022. The dataset is available at https://github.com/cl-tohoku/TYPIC

arXiv:2110.13692 [pdf, other]

Annotating Implicit Reasoning in Arguments with Causal Links

Authors: Keshav Singh, Naoya Inoue, Farjana Sultana Mim, Shoichi Naitoh, Kentaro Inui

Abstract: Most of the existing work that focus on the identification of implicit knowledge in arguments generally represent implicit knowledge in the form of commonsense or factual knowledge. However, such knowledge is not sufficient to understand the implicit reasoning link between individual argumentative components (i.e., claim and premise). In this work, we focus on identifying the implicit knowledge in… ▽ More Most of the existing work that focus on the identification of implicit knowledge in arguments generally represent implicit knowledge in the form of commonsense or factual knowledge. However, such knowledge is not sufficient to understand the implicit reasoning link between individual argumentative components (i.e., claim and premise). In this work, we focus on identifying the implicit knowledge in the form of argumentation knowledge which can help in understanding the reasoning link in arguments. Being inspired by the Argument from Consequences scheme, we propose a semi-structured template to represent such argumentation knowledge that explicates the implicit reasoning in arguments via causality. We create a novel two-phase annotation process with simplified guidelines and show how to collect and filter high-quality implicit reasonings via crowdsourcing. We find substantial inter-annotator agreement for quality evaluation between experts, but find evidence that casts a few questions on the feasibility of collecting high-quality semi-structured implicit reasoning through our crowdsourcing process. We release our materials(i.e., crowdsourcing guidelines and collected implicit reasonings) to facilitate further research towards the structured representation of argumentation knowledge. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Comments: Accepted to ArgKG:Workshop on Argumentation Knowledge Graphs (AKBC 2021)

arXiv:2109.13497 [pdf, other]

Instance-Based Neural Dependency Parsing

Authors: Hiroki Ouchi, Jun Suzuki, Sosuke Kobayashi, Sho Yokoi, Tatsuki Kuribayashi, Masashi Yoshikawa, Kentaro Inui

Abstract: Interpretable rationales for model predictions are crucial in practical applications. We develop neural models that possess an interpretable inference process for dependency parsing. Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set. The training edges are explicitly used for the predictions; thus, it is easy to… ▽ More Interpretable rationales for model predictions are crucial in practical applications. We develop neural models that possess an interpretable inference process for dependency parsing. Our models adopt instance-based inference, where dependency edges are extracted and labeled by comparing them to edges in a training set. The training edges are explicitly used for the predictions; thus, it is easy to grasp the contribution of each edge to the predictions. Our experiments show that our instance-based models achieve competitive accuracy with standard neural models and have the reasonable plausibility of instance-based explanations. △ Less

Submitted 28 September, 2021; originally announced September 2021.

Comments: 15 pages, accepted to TACL 2021

arXiv:2109.07152 [pdf, other]

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Authors: Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, Kentaro Inui

Abstract: Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only composed of the multi-head attention; other components can also contribute to Transformers' progressive performance. In this study, we extended the scope of the… ▽ More Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only composed of the multi-head attention; other components can also contribute to Transformers' progressive performance. In this study, we extended the scope of the analysis of Transformers from solely the attention patterns to the whole attention block, i.e., multi-head attention, residual connection, and layer normalization. Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed. These results provide new intuitive explanations of existing reports; for example, discarding the learned attention patterns tends not to adversely affect the performance. The codes of our experiments are publicly available. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: 22 pages, accepted to EMNLP 2021 main conference

arXiv:2109.07080 [pdf, other]

Transformer-based Lexically Constrained Headline Generation

Authors: Kosuke Yamada, Yuta Hitomi, Hideaki Tamori, Ryohei Sasano, Naoaki Okazaki, Kentaro Inui, Koichi Takeda

Abstract: This paper explores a variant of automatic headline generation methods, where a generated headline is required to include a given phrase such as a company or a product name. Previous methods using Transformer-based models generate a headline including a given phrase by providing the encoder with additional information corresponding to the given phrase. However, these methods cannot always include… ▽ More This paper explores a variant of automatic headline generation methods, where a generated headline is required to include a given phrase such as a company or a product name. Previous methods using Transformer-based models generate a headline including a given phrase by providing the encoder with additional information corresponding to the given phrase. However, these methods cannot always include the phrase in the generated headline. Inspired by previous RNN-based methods generating token sequences in backward and forward directions from the given phrase, we propose a simple Transformer-based method that guarantees to include the given phrase in the high-quality generated headline. We also consider a new headline generation strategy that takes advantage of the controllable generation order of Transformer. Our experiments with the Japanese News Corpus demonstrate that our methods, which are guaranteed to include the phrase in the generated headline, achieve ROUGE scores comparable to previous Transformer-based methods. We also show that our generation strategy performs better than previous strategies. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: EMNLP 2021

arXiv:2109.06853 [pdf, other]

Summarize-then-Answer: Generating Concise Explanations for Multi-hop Reading Comprehension

Authors: Naoya Inoue, Harsh Trivedi, Steven Sinha, Niranjan Balasubramanian, Kentaro Inui

Abstract: How can we generate concise explanations for multi-hop Reading Comprehension (RC)? The current strategies of identifying supporting sentences can be seen as an extractive question-focused summarization of the input text. However, these extractive explanations are not necessarily concise i.e. not minimally sufficient for answering a question. Instead, we advocate for an abstractive approach, where… ▽ More How can we generate concise explanations for multi-hop Reading Comprehension (RC)? The current strategies of identifying supporting sentences can be seen as an extractive question-focused summarization of the input text. However, these extractive explanations are not necessarily concise i.e. not minimally sufficient for answering a question. Instead, we advocate for an abstractive approach, where we propose to generate a question-focused, abstractive summary of input paragraphs and then feed it to an RC system. Given a limited amount of human-annotated abstractive explanations, we train the abstractive explainer in a semi-supervised manner, where we start from the supervised model and then train it further through trial and error maximizing a conciseness-promoted reward function. Our experiments demonstrate that the proposed abstractive explainer can generate more compact explanations than an extractive explainer with limited supervision (only 2k instances) while maintaining sufficiency. △ Less

Submitted 14 September, 2021; originally announced September 2021.

Comments: Accepted to EMNLP2021 Long Paper (Main Track)

arXiv:2109.05644 [pdf, other]

SHAPE: Shifted Absolute Position Embedding for Transformers

Authors: Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, Kentaro Inui

Abstract: Position representation is crucial for building position-aware representations in Transformers. Existing position representations suffer from a lack of generalization to test data with unseen lengths or high computational cost. We investigate shifted absolute position embedding (SHAPE) to address both issues. The basic idea of SHAPE is to achieve shift invariance, which is a key property of recent… ▽ More Position representation is crucial for building position-aware representations in Transformers. Existing position representations suffer from a lack of generalization to test data with unseen lengths or high computational cost. We investigate shifted absolute position embedding (SHAPE) to address both issues. The basic idea of SHAPE is to achieve shift invariance, which is a key property of recent successful position representations, by randomly shifting absolute positions during training. We demonstrate that SHAPE is empirically comparable to its counterpart while being simpler and faster. △ Less

Submitted 12 September, 2021; originally announced September 2021.

Comments: EMNLP 2021 (short paper, main conference)

arXiv:2108.08631 [pdf, other]

doi 10.1103/PhysRevResearch.3.043126

Determinant-free fermionic wave function using feed-forward neural networks

Authors: Koji Inui, Yasuyuki Kato, Yukitoshi Motome

Abstract: We propose a general framework for finding the ground state of many-body fermionic systems by using feed-forward neural networks. The anticommutation relation for fermions is usually implemented to a variational wave function by the Slater determinant (or Pfaffian), which is a computational bottleneck because of the numerical cost of $O(N^3)$ for $N$ particles. We bypass this bottleneck by explici… ▽ More We propose a general framework for finding the ground state of many-body fermionic systems by using feed-forward neural networks. The anticommutation relation for fermions is usually implemented to a variational wave function by the Slater determinant (or Pfaffian), which is a computational bottleneck because of the numerical cost of $O(N^3)$ for $N$ particles. We bypass this bottleneck by explicitly calculating the sign changes associated with particle exchanges in real space and using fully connected neural networks for optimizing the rest parts of the wave function. This reduces the computational cost to $O(N^2)$ or less. We show that the accuracy of the approximation can be improved by optimizing the "variance" of the energy simultaneously with the energy itself. We also find that a reweighting method in Monte Carlo sampling can stabilize the calculation. These improvements can be applied to other approaches based on variational Monte Carlo methods. Moreover, we show that the accuracy can be further improved by using the symmetry of the system, the representative states, and an additional neural network implementing a generalized Gutzwiller-Jastrow factor. We demonstrate the efficiency of the method by applying it to a two-dimensional Hubbard model. △ Less

Submitted 22 August, 2021; v1 submitted 19 August, 2021; originally announced August 2021.

Journal ref: Phys. Rev. Research 3, 043126 (2021)

arXiv:2106.01229 [pdf, other]

Lower Perplexity is Not Always Human-Like

Authors: Tatsuki Kuribayashi, Yohei Oseki, Takumi Ito, Ryo Yoshida, Masayuki Asahara, Kentaro Inui

Abstract: In computational psycholinguistics, various language models have been evaluated against human reading behavior (e.g., eye movement) to build human-like computational models. However, most previous efforts have focused almost exclusively on English, despite the recent trend towards linguistic universal within the general community. In order to fill the gap, this paper investigates whether the estab… ▽ More In computational psycholinguistics, various language models have been evaluated against human reading behavior (e.g., eye movement) to build human-like computational models. However, most previous efforts have focused almost exclusively on English, despite the recent trend towards linguistic universal within the general community. In order to fill the gap, this paper investigates whether the established results in computational psycholinguistics can be generalized across languages. Specifically, we re-examine an established generalization -- the lower perplexity a language model has, the more human-like the language model is -- in Japanese with typologically different structures from English. Our experiments demonstrate that this established generalization exhibits a surprising lack of universality; namely, lower perplexity is not always human-like. Moreover, this discrepancy between English and Japanese is further explored from the perspective of (non-)uniform information density. Overall, our results suggest that a cross-lingual evaluation will be necessary to construct human-like computational models. △ Less

Submitted 1 November, 2022; v1 submitted 2 June, 2021; originally announced June 2021.

Comments: Accepted by ACL 2021

arXiv:2106.01077 [pdf, other]

SyGNS: A Systematic Generalization Testbed Based on Natural Language Semantics

Authors: Hitomi Yanaka, Koji Mineshima, Kentaro Inui

Abstract: Recently, deep neural networks (DNNs) have achieved great success in semantically challenging NLP tasks, yet it remains unclear whether DNN models can capture compositional meanings, those aspects of meaning that have been long studied in formal semantics. To investigate this issue, we propose a Systematic Generalization testbed based on Natural language Semantics (SyGNS), whose challenge is to ma… ▽ More Recently, deep neural networks (DNNs) have achieved great success in semantically challenging NLP tasks, yet it remains unclear whether DNN models can capture compositional meanings, those aspects of meaning that have been long studied in formal semantics. To investigate this issue, we propose a Systematic Generalization testbed based on Natural language Semantics (SyGNS), whose challenge is to map natural language sentences to multiple forms of scoped meaning representations, designed to account for various semantic phenomena. Using SyGNS, we test whether neural networks can systematically parse sentences involving novel combinations of logical expressions such as quantifiers and negation. Experiments show that Transformer and GRU models can generalize to unseen combinations of quantifiers, negations, and modifiers that are similar to given training instances in form, but not to the others. We also find that the generalization performance to unseen combinations is better when the form of meaning representations is simpler. The data and code for SyGNS are publicly available at https://github.com/verypluming/SyGNS. △ Less

Submitted 2 June, 2021; originally announced June 2021.

Comments: Findings (long paper) of ACL-IJCNLP2021

arXiv:2104.11514 [pdf, other]

Learning to Learn to be Right for the Right Reasons

Authors: Pride Kavumba, Benjamin Heinzerling, Ana Brassard, Kentaro Inui

Abstract: Improving model generalization on held-out data is one of the core objectives in commonsense reasoning. Recent work has shown that models trained on the dataset with superficial cues tend to perform well on the easy test set with superficial cues but perform poorly on the hard test set without superficial cues. Previous approaches have resorted to manual methods of encouraging models not to overfi… ▽ More Improving model generalization on held-out data is one of the core objectives in commonsense reasoning. Recent work has shown that models trained on the dataset with superficial cues tend to perform well on the easy test set with superficial cues but perform poorly on the hard test set without superficial cues. Previous approaches have resorted to manual methods of encouraging models not to overfit to superficial cues. While some of the methods have improved performance on hard instances, they also lead to degraded performance on easy instances. Here, we propose to explicitly learn a model that does well on both the easy test set with superficial cues and hard test set without superficial cues. Using a meta-learning objective, we learn such a model that improves performance on both the easy test set and the hard test set. By evaluating our models on Choice of Plausible Alternatives (COPA) and Commonsense Explanation, we show that our proposed method leads to improved performance on both the easy test set and the hard test set upon which we observe up to 16.5 percentage points improvement over the baseline. △ Less

Submitted 23 April, 2021; originally announced April 2021.

arXiv:2104.07924 [pdf, other]

A Comparative Study on Collecting High-Quality Implicit Reasonings at a Large-scale

Authors: Keshav Singh, Paul Reisert, Naoya Inoue, Kentaro Inui

Abstract: Explicating implicit reasoning (i.e. warrants) in arguments is a long-standing challenge for natural language understanding systems. While recent approaches have focused on explicating warrants via crowdsourcing or expert annotations, the quality of warrants has been questionable due to the extreme complexity and subjectivity of the task. In this paper, we tackle the complex task of warrant explic… ▽ More Explicating implicit reasoning (i.e. warrants) in arguments is a long-standing challenge for natural language understanding systems. While recent approaches have focused on explicating warrants via crowdsourcing or expert annotations, the quality of warrants has been questionable due to the extreme complexity and subjectivity of the task. In this paper, we tackle the complex task of warrant explication and devise various methodologies for collecting warrants. We conduct an extensive study with trained experts to evaluate the resulting warrants of each methodology and find that our methodologies allow for high-quality warrants to be collected. We construct a preliminary dataset of 6,000 warrants annotated over 600 arguments for 3 debatable topics. To facilitate research in related downstream tasks, we release our guidelines and preliminary dataset. △ Less

Submitted 16 April, 2021; originally announced April 2021.

Comments: 2 figures, 3 tables

arXiv:2104.07425 [pdf, other]

Pseudo Zero Pronoun Resolution Improves Zero Anaphora Resolution

Authors: Ryuto Konno, Shun Kiyono, Yuichiroh Matsubayashi, Hiroki Ouchi, Kentaro Inui

Abstract: Masked language models (MLMs) have contributed to drastic performance improvements with regard to zero anaphora resolution (ZAR). To further improve this approach, in this study, we made two proposals. The first is a new pretraining task that trains MLMs on anaphoric relations with explicit supervision, and the second proposal is a new finetuning method that remedies a notorious issue, the pretrai… ▽ More Masked language models (MLMs) have contributed to drastic performance improvements with regard to zero anaphora resolution (ZAR). To further improve this approach, in this study, we made two proposals. The first is a new pretraining task that trains MLMs on anaphoric relations with explicit supervision, and the second proposal is a new finetuning method that remedies a notorious issue, the pretrain-finetune discrepancy. Our experiments on Japanese ZAR demonstrated that our two proposals boost the state-of-the-art performance, and our detailed analysis provides new insights on the remaining challenges. △ Less

Submitted 10 September, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

Comments: Long paper accepted by EMNLP2021 main conference

arXiv:2102.06540 [pdf, other]

Two Training Strategies for Improving Relation Extraction over Universal Graph

Authors: Qin Dai, Naoya Inoue, Ryo Takahashi, Kentaro Inui

Abstract: This paper explores how the Distantly Supervised Relation Extraction (DS-RE) can benefit from the use of a Universal Graph (UG), the combination of a Knowledge Graph (KG) and a large-scale text collection. A straightforward extension of a current state-of-the-art neural model for DS-RE with a UG may lead to degradation in performance. We first report that this degradation is associated with the di… ▽ More This paper explores how the Distantly Supervised Relation Extraction (DS-RE) can benefit from the use of a Universal Graph (UG), the combination of a Knowledge Graph (KG) and a large-scale text collection. A straightforward extension of a current state-of-the-art neural model for DS-RE with a UG may lead to degradation in performance. We first report that this degradation is associated with the difficulty in learning a UG and then propose two training strategies: (1) Path Type Adaptive Pretraining, which sequentially trains the model with different types of UG paths so as to prevent the reliance on a single type of UG path; and (2) Complexity Ranking Guided Attention mechanism, which restricts the attention span according to the complexity of a UG path so as to force the model to extract features not only from simple UG paths but also from complex ones. Experimental results on both biomedical and NYT10 datasets prove the robustness of our methods and achieve a new state-of-the-art result on the NYT10 dataset. The code and datasets used in this paper are available at https://github.com/baodaiqin/UGDSRE. △ Less

Submitted 6 May, 2021; v1 submitted 12 February, 2021; originally announced February 2021.

arXiv:2101.10713 [pdf, other]

Exploring Transitivity in Neural NLI Models through Veridicality

Authors: Hitomi Yanaka, Koji Mineshima, Kentaro Inui

Abstract: Despite the recent success of deep neural networks in natural language processing, the extent to which they can demonstrate human-like generalization capacities for natural language understanding remains unclear. We explore this issue in the domain of natural language inference (NLI), focusing on the transitivity of inference relations, a fundamental property for systematically drawing inferences.… ▽ More Despite the recent success of deep neural networks in natural language processing, the extent to which they can demonstrate human-like generalization capacities for natural language understanding remains unclear. We explore this issue in the domain of natural language inference (NLI), focusing on the transitivity of inference relations, a fundamental property for systematically drawing inferences. A model capturing transitivity can compose basic inference patterns and draw new inferences. We introduce an analysis method using synthetic and naturalistic NLI datasets involving clause-embedding verbs to evaluate whether models can perform transitivity inferences composed of veridical inferences and arbitrary inference types. We find that current NLI models do not perform consistently well on transitivity inference tasks, suggesting that they lack the generalization capacity for drawing composite inferences from provided training examples. The data and code for our analysis are publicly available at https://github.com/verypluming/transitivity. △ Less

Submitted 26 January, 2021; originally announced January 2021.

Comments: accepted by EACL2021 as a long paper

arXiv:2012.04207 [pdf, other]

Efficient Estimation of Influence of a Training Instance

Authors: Sosuke Kobayashi, Sho Yokoi, Jun Suzuki, Kentaro Inui

Abstract: Understanding the influence of a training instance on a neural network model leads to improving interpretability. However, it is difficult and inefficient to evaluate the influence, which shows how a model's prediction would be changed if a training instance were not used. In this paper, we propose an efficient method for estimating the influence. Our method is inspired by dropout, which zero-mask… ▽ More Understanding the influence of a training instance on a neural network model leads to improving interpretability. However, it is difficult and inefficient to evaluate the influence, which shows how a model's prediction would be changed if a training instance were not used. In this paper, we propose an efficient method for estimating the influence. Our method is inspired by dropout, which zero-masks a sub-network and prevents the sub-network from learning each training instance. By switching between dropout masks, we can use sub-networks that learned or did not learn each training instance and estimate its influence. Through experiments with BERT and VGGNet on classification datasets, we demonstrate that the proposed method can capture training influences, enhance the interpretability of error predictions, and cleanse the training dataset for improving generalization. △ Less

Submitted 19 November, 2021; v1 submitted 7 December, 2020; originally announced December 2020.

Comments: This is an extended version of the paper presented at SustaiNLP 2020

arXiv:2011.02121 [pdf, other]

PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents

Authors: Ryo Fujii, Masato Mita, Kaori Abe, Kazuaki Hanawa, Makoto Morishita, Jun Suzuki, Kentaro Inui

Abstract: Neural Machine Translation (NMT) has shown drastic improvement in its quality when translating clean input, such as text from the news domain. However, existing studies suggest that NMT still struggles with certain kinds of input with considerable noise, such as User-Generated Contents (UGC) on the Internet. To make better use of NMT for cross-cultural communication, one of the most promising dire… ▽ More Neural Machine Translation (NMT) has shown drastic improvement in its quality when translating clean input, such as text from the news domain. However, existing studies suggest that NMT still struggles with certain kinds of input with considerable noise, such as User-Generated Contents (UGC) on the Internet. To make better use of NMT for cross-cultural communication, one of the most promising directions is to develop a model that correctly handles these expressions. Though its importance has been recognized, it is still not clear as to what creates the great gap in performance between the translation of clean input and that of UGC. To answer the question, we present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation. Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena. △ Less

Submitted 3 November, 2020; originally announced November 2020.

Comments: 15 pages, 4 figures, accepted at COLING 2020

arXiv:2011.01785 [pdf, other]

Modeling Event Salience in Narratives via Barthes' Cardinal Functions

Authors: Takaki Otake, Sho Yokoi, Naoya Inoue, Ryo Takahashi, Tatsuki Kuribayashi, Kentaro Inui

Abstract: Events in a narrative differ in salience: some are more important to the story than others. Estimating event salience is useful for tasks such as story generation, and as a tool for text analysis in narratology and folkloristics. To compute event salience without any annotations, we adopt Barthes' definition of event salience and propose several unsupervised methods that require only a pre-trained… ▽ More Events in a narrative differ in salience: some are more important to the story than others. Estimating event salience is useful for tasks such as story generation, and as a tool for text analysis in narratology and folkloristics. To compute event salience without any annotations, we adopt Barthes' definition of event salience and propose several unsupervised methods that require only a pre-trained language model. Evaluating the proposed methods on folktales with event salience annotation, we show that the proposed methods outperform baseline methods and find fine-tuning a language model on narrative texts is a key factor in improving the proposed methods. △ Less

Submitted 3 November, 2020; originally announced November 2020.

Comments: accepted to COLING 2020

arXiv:2011.00948 [pdf, other]

An Empirical Study of Contextual Data Augmentation for Japanese Zero Anaphora Resolution

Authors: Ryuto Konno, Yuichiroh Matsubayashi, Shun Kiyono, Hiroki Ouchi, Ryo Takahashi, Kentaro Inui

Abstract: One critical issue of zero anaphora resolution (ZAR) is the scarcity of labeled data. This study explores how effectively this problem can be alleviated by data augmentation. We adopt a state-of-the-art data augmentation method, called the contextual data augmentation (CDA), that generates labeled training instances using a pretrained language model. The CDA has been reported to work well for seve… ▽ More One critical issue of zero anaphora resolution (ZAR) is the scarcity of labeled data. This study explores how effectively this problem can be alleviated by data augmentation. We adopt a state-of-the-art data augmentation method, called the contextual data augmentation (CDA), that generates labeled training instances using a pretrained language model. The CDA has been reported to work well for several other natural language processing tasks, including text classification and machine translation. This study addresses two underexplored issues on CDA, that is, how to reduce the computational cost of data augmentation and how to ensure the quality of the generated data. We also propose two methods to adapt CDA to ZAR: [MASK]-based augmentation and linguistically-controlled masking. Consequently, the experimental results on Japanese ZAR show that our methods contribute to both the accuracy gain and the computation cost reduction. Our closer analysis reveals that the proposed method can improve the quality of the augmented training data when compared to the conventional CDA. △ Less

Submitted 4 November, 2020; v1 submitted 2 November, 2020; originally announced November 2020.

Comments: 13 pages, accepted by COLING 2020

arXiv:2010.06137 [pdf, other]

Corruption Is Not All Bad: Incorporating Discourse Structure into Pre-training via Corruption for Essay Scoring

Authors: Farjana Sultana Mim, Naoya Inoue, Paul Reisert, Hiroki Ouchi, Kentaro Inui

Abstract: Existing approaches for automated essay scoring and document representation learning typically rely on discourse parsers to incorporate discourse structure into text representation. However, the performance of parsers is not always adequate, especially when they are used on noisy texts, such as student essays. In this paper, we propose an unsupervised pre-training approach to capture discourse str… ▽ More Existing approaches for automated essay scoring and document representation learning typically rely on discourse parsers to incorporate discourse structure into text representation. However, the performance of parsers is not always adequate, especially when they are used on noisy texts, such as student essays. In this paper, we propose an unsupervised pre-training approach to capture discourse structure of essays in terms of coherence and cohesion that does not require any discourse parser or annotation. We introduce several types of token, sentence and paragraph-level corruption techniques for our proposed pre-training approach and augment masked language modeling pre-training with our pre-training method to leverage both contextualized and discourse information. Our proposed unsupervised approach achieves new state-of-the-art result on essay Organization scoring task. △ Less

Submitted 12 October, 2020; originally announced October 2020.

arXiv:2010.04332 [pdf, other]

Langsmith: An Interactive Academic Text Revision System

Authors: Takumi Ito, Tatsuki Kuribayashi, Masatoshi Hidaka, Jun Suzuki, Kentaro Inui

Abstract: Despite the current diversity and inclusion initiatives in the academic community, researchers with a non-native command of English still face significant obstacles when writing papers in English. This paper presents the Langsmith editor, which assists inexperienced, non-native researchers to write English papers, especially in the natural language processing (NLP) field. Our system can suggest fl… ▽ More Despite the current diversity and inclusion initiatives in the academic community, researchers with a non-native command of English still face significant obstacles when writing papers in English. This paper presents the Langsmith editor, which assists inexperienced, non-native researchers to write English papers, especially in the natural language processing (NLP) field. Our system can suggest fluent, academic-style sentences to writers based on their rough, incomplete phrases or sentences. The system also encourages interaction between human writers and the computerized revision system. The experimental results demonstrated that Langsmith helps non-native English-speaker students write papers in English. The system is available at https://emnlp-demo.editor. langsmith.co.jp/. △ Less

Submitted 8 October, 2020; originally announced October 2020.

Comments: Accepted at EMNLP 2020 (system demonstrations)

Showing 1–50 of 94 results for author: Inui, K