-
Multilingual Contrastive Decoding via Language-Agnostic Layers Skipping
Authors:
Wenhao Zhu,
Sizhe Liu,
Shujian Huang,
Shuaijie She,
Chris Wendler,
Jiajun Chen
Abstract:
Decoding by contrasting layers (DoLa), is designed to improve the generation quality of large language models (LLMs) by contrasting the prediction probabilities between an early exit output (amateur logits) and the final output (expert logits). However, we find that this approach does not work well on non-English tasks. Inspired by previous interpretability work on language transition during the m…
▽ More
Decoding by contrasting layers (DoLa), is designed to improve the generation quality of large language models (LLMs) by contrasting the prediction probabilities between an early exit output (amateur logits) and the final output (expert logits). However, we find that this approach does not work well on non-English tasks. Inspired by previous interpretability work on language transition during the model's forward pass, we discover that this issue arises from a language mismatch between early exit output and final output. In this work, we propose an improved contrastive decoding algorithm that is effective for diverse languages beyond English. To obtain more helpful amateur logits, we devise two strategies to skip a set of bottom, language-agnostic layers based on our preliminary analysis. Experimental results on multilingual reasoning benchmarks demonstrate that our proposed method outperforms previous contrastive decoding baselines and substantially improves LLM's chain-of-thought reasoning accuracy across 11 languages. The project will be available at: https://github.com/NJUNLP/SkipLayerCD.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Why Not Transform Chat Large Language Models to Non-English?
Authors:
Xiang Geng,
Ming Zhu,
Jiahuan Li,
Zhejian Lai,
Wei Zou,
Shuaijie She,
Jiaxin Guo,
Xiaofeng Zhao,
Yinglu Li,
Yuang Li,
Chang Su,
Yanqing Zhao,
Xinglin Lyu,
Min Zhang,
Jiajun Chen,
Hao Yang,
Shujian Huang
Abstract:
The scarcity of non-English data limits the development of non-English large language models (LLMs). Transforming English-centric LLMs to non-English has been identified as an effective and resource-efficient method. Previous works start from base LLMs and perform knowledge distillation (KD) with data generated by stronger LLMs, e.g. GPT-4. Compared to base LLMs, chat LLMs are further optimized fo…
▽ More
The scarcity of non-English data limits the development of non-English large language models (LLMs). Transforming English-centric LLMs to non-English has been identified as an effective and resource-efficient method. Previous works start from base LLMs and perform knowledge distillation (KD) with data generated by stronger LLMs, e.g. GPT-4. Compared to base LLMs, chat LLMs are further optimized for advanced abilities, e.g. multi-turn conversation and human preference alignment, and thus more powerful in both helpfulness and safety. However, transforming a chat LLM involves two critical issues: (1) How can we effectively transfer advanced abilities without their supervised data? (2) How can we prevent the original knowledge from catastrophic forgetting during transformation? We target these issues by introducing a simple framework called TransLLM. For the first issue, TransLLM divides the transfer problem into some common sub-tasks with the translation chain-of-thought, which uses the translation as the bridge between English and non-English step-by-step. We further enhance the performance of sub-tasks with publicly available data. For the second issue, we propose a method comprising two synergistic components: low-rank adaptation for training to maintain the original LLM parameters, and recovery KD, which utilizes data generated by the chat LLM itself to recover the original knowledge from the frozen parameters. In the experiments, we transform the LLaMA-2-chat-7B to the Thai language. Our method, using only single-turn data, outperforms strong baselines and ChatGPT on multi-turn benchmark MT-bench. Furthermore, our method, without safety data, rejects more harmful queries of safety benchmark AdvBench than both ChatGPT and GPT-4.
△ Less
Submitted 31 May, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Question Translation Training for Better Multilingual Reasoning
Authors:
Wenhao Zhu,
Shujian Huang,
Fei Yuan,
Shuaijie She,
Jiajun Chen,
Alexandra Birch
Abstract:
Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English. This is unsurprising given that their training data largely consists of English text and instructions. A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translat…
▽ More
Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English. This is unsurprising given that their training data largely consists of English text and instructions. A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training. This approach not only incurs high cost, but also results in poorly translated data due to the non-standard formatting of mathematical chain-of-thought. In this paper, we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data. In this way we perform targeted, in-domain language alignment which makes best use of English instruction data to unlock the LLMs' multilingual reasoning abilities. Experimental results on LLaMA2-13B show that question alignment leads to consistent improvements over the translate-training approach: an average improvement of 11.3% and 16.1% accuracy across ten languages on the MGSM and MSVAMP multilingual reasoning benchmarks. The project will be available at: https://github.com/NJUNLP/QAlign.
△ Less
Submitted 29 June, 2024; v1 submitted 15 January, 2024;
originally announced January 2024.
-
MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization
Authors:
Shuaijie She,
Wei Zou,
Shujian Huang,
Wenhao Zhu,
Xiang Liu,
Xiang Geng,
Jiajun Chen
Abstract:
Though reasoning abilities are considered language-agnostic, existing LLMs exhibit inconsistent reasoning abilities across different languages, e.g., reasoning in the dominant language like English is superior to other languages due to the imbalance of multilingual training data. To enhance reasoning abilities in non-dominant languages, we propose a Multilingual-Alignment-as-Preference Optimizatio…
▽ More
Though reasoning abilities are considered language-agnostic, existing LLMs exhibit inconsistent reasoning abilities across different languages, e.g., reasoning in the dominant language like English is superior to other languages due to the imbalance of multilingual training data. To enhance reasoning abilities in non-dominant languages, we propose a Multilingual-Alignment-as-Preference Optimization framework (MAPO), aiming to align the reasoning processes in other languages with the dominant language. Specifically, we harness an off-the-shelf translation model for the consistency between answers in non-dominant and dominant languages, which we adopt as the preference for optimization, e.g., Direct Preference Optimization (DPO) or Proximal Policy Optimization (PPO). Experiments show that MAPO stably achieves significant improvements in the multilingual reasoning of various models on all three benchmarks (MSVAMP +16.2%, MGSM +6.1%, and MNumGLUESub +13.3%), with improved reasoning consistency across languages.
△ Less
Submitted 13 April, 2024; v1 submitted 12 January, 2024;
originally announced January 2024.
-
Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models
Authors:
Shuaijie She,
Shujian Huang,
Xingyun Wang,
Yanke Zhou,
Jiajun Chen
Abstract:
LLMs (Large Language Models) usually interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation focusing on the factual consistency issue with the help…
▽ More
LLMs (Large Language Models) usually interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation focusing on the factual consistency issue with the help of the dialogue summarization task. Besides evaluating and analyzing the dialogue summarization performance (DIAC-Sum) of different LLMs, we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-QA). Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest model evaluated, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average error rate of all evaluated LLMs is 36.1%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still challenging for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data, which achieved a relative error rate reduction of 11% on DIAC-QA.
△ Less
Submitted 1 April, 2024; v1 submitted 13 November, 2023;
originally announced November 2023.
-
ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning
Authors:
Jingyuan Selena She,
Christopher Potts,
Samuel R. Bowman,
Atticus Geiger
Abstract:
A number of recent benchmarks seek to assess how well models handle natural language negation. However, these benchmarks lack the controlled example paradigms that would allow us to infer whether a model had learned how negation morphemes semantically scope. To fill these analytical gaps, we present the Scoped Negation NLI (ScoNe-NLI) benchmark, which contains contrast sets of six examples with up…
▽ More
A number of recent benchmarks seek to assess how well models handle natural language negation. However, these benchmarks lack the controlled example paradigms that would allow us to infer whether a model had learned how negation morphemes semantically scope. To fill these analytical gaps, we present the Scoped Negation NLI (ScoNe-NLI) benchmark, which contains contrast sets of six examples with up to two negations where either zero, one, or both negative morphemes affect the NLI label. We use ScoNe-NLI to assess fine-tuning and in-context learning strategies. We find that RoBERTa and DeBERTa models solve ScoNe-NLI after many shot fine-tuning. For in-context learning, we test InstructGPT models and find that most prompt strategies are not successful, including those using step-by-step reasoning. To better understand this result, we extend ScoNe with ScoNe-NLG, a sentence completion test set that embeds negation reasoning in short narratives. Here, InstructGPT is successful, which reveals the model can correctly reason about negation, but struggles to do so on prompt-adapted NLI examples outside of its core pretraining regime.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Evaporation characteristics of Er$^{3+}$ doped silica fiber and its application in the preparation of whispering gallery mode lasers
Authors:
Angzhen Li,
Jonathan M. Ward,
Ke Tian,
Jibo Yu,
Shengfei She,
Chaoqi Hou,
Haitao Guo,
Síle Nic Chormaic,
Pengfei Wang
Abstract:
The fabrication of whispering gallery lasers (WGL) is used to experimentally evaluate the evaporation rate (mol/$μ$m) and ratio (mol/mol) of erbium and silica lost from a doped fiber during heating. Fixed lengths of doped silica fiber are spliced to different lengths of undoped fiber and then evaporated by feeding into the focus of a CO$_{2}$ laser. During evaporation, erbium ions are precipitated…
▽ More
The fabrication of whispering gallery lasers (WGL) is used to experimentally evaluate the evaporation rate (mol/$μ$m) and ratio (mol/mol) of erbium and silica lost from a doped fiber during heating. Fixed lengths of doped silica fiber are spliced to different lengths of undoped fiber and then evaporated by feeding into the focus of a CO$_{2}$ laser. During evaporation, erbium ions are precipitated in the doped silica fiber to control the erbium concentration in the remaining SiO$_2$, which is melted into a microsphere. By increasing the length of the undoped section, a critical point is reached where effectively no ions remain in the glass microsphere. The critical point is found using the lasing spectra of the whispering gallery modes in microspheres with equal sizes. From the critical point, it is estimated that, for a given CO$_{2}$ laser power, $6.36 \times 10^{-21}$~mol of Er$^{3+}$ is lost during the evaporation process for every cubic micron of silica fiber. This is equivalent to $1.74 \times 10^{-7}$~mol of Er$^{3+}$ lost per mol of SiO$_{2}$ evaporated. This result facilitates the control of the doping concentration in WGLs and provides insight into the kinetics of laser-induced evaporation of doped silica.
△ Less
Submitted 12 May, 2023;
originally announced May 2023.
-
CoP: Factual Inconsistency Detection by Controlling the Preference
Authors:
Shuaijie She,
Xiang Geng,
Shujian Huang,
Jiajun Chen
Abstract:
Abstractive summarization is the process of generating a summary given a document as input. Although significant progress has been made, the factual inconsistency between the document and the generated summary still limits its practical applications. Previous work found that the probabilities assigned by the generation model reflect its preferences for the generated summary, including the preferen…
▽ More
Abstractive summarization is the process of generating a summary given a document as input. Although significant progress has been made, the factual inconsistency between the document and the generated summary still limits its practical applications. Previous work found that the probabilities assigned by the generation model reflect its preferences for the generated summary, including the preference for factual consistency, and the preference for the language or knowledge prior as well. To separate the preference for factual consistency, we propose an unsupervised framework named CoP by controlling the preference of the generation model with the help of prompt. More specifically, the framework performs an extra inference step in which a text prompt is introduced as an additional input. In this way, another preference is described by the generation probability of this extra inference process. The difference between the above two preferences, i.e. the difference between the probabilities, could be used as measurements for detecting factual inconsistencies. Interestingly, we found that with the properly designed prompt, our framework could evaluate specific preferences and serve as measurements for fine-grained categories of inconsistency, such as entity-related inconsistency, coreference-related inconsistency, etc. Moreover, our framework could also be extended to the supervised setting to learn better prompt from the labeled data as well. Experiments show that our framework achieves new SOTA results on three factual inconsistency detection tasks.
△ Less
Submitted 30 March, 2023; v1 submitted 3 December, 2022;
originally announced December 2022.
-
Event knowledge in large language models: the gap between the impossible and the unlikely
Authors:
Carina Kauf,
Anna A. Ivanova,
Giulia Rambelli,
Emmanuele Chersoni,
Jingyuan Selena She,
Zawad Chowdhury,
Evelina Fedorenko,
Alessandro Lenci
Abstract:
Word co-occurrence patterns in language corpora contain a surprising amount of conceptual knowledge. Large language models (LLMs), trained to predict words in context, leverage these patterns to achieve impressive performance on diverse semantic tasks requiring world knowledge. An important but understudied question about LLMs' semantic abilities is whether they acquire generalized knowledge of co…
▽ More
Word co-occurrence patterns in language corpora contain a surprising amount of conceptual knowledge. Large language models (LLMs), trained to predict words in context, leverage these patterns to achieve impressive performance on diverse semantic tasks requiring world knowledge. An important but understudied question about LLMs' semantic abilities is whether they acquire generalized knowledge of common events. Here, we test whether five pre-trained LLMs (from 2018's BERT to 2023's MPT) assign higher likelihood to plausible descriptions of agent-patient interactions than to minimally different implausible versions of the same event. Using three curated sets of minimal sentence pairs (total n=1,215), we found that pre-trained LLMs possess substantial event knowledge, outperforming other distributional language models. In particular, they almost always assign higher likelihood to possible vs. impossible events (The teacher bought the laptop vs. The laptop bought the teacher). However, LLMs show less consistent preferences for likely vs. unlikely events (The nanny tutored the boy vs. The boy tutored the nanny). In follow-up analyses, we show that (i) LLM scores are driven by both plausibility and surface-level sentence features, (ii) LLM scores generalize well across syntactic variants (active vs. passive constructions) but less well across semantic variants (synonymous sentences), (iii) some LLM errors mirror human judgment ambiguity, and (iv) sentence plausibility serves as an organizing dimension in internal LLM representations. Overall, our results show that important aspects of event knowledge naturally emerge from distributional linguistic patterns, but also highlight a gap between representations of possible/impossible and likely/unlikely events.
△ Less
Submitted 26 October, 2023; v1 submitted 2 December, 2022;
originally announced December 2022.
-
Formal Semantics of the CDL Language
Authors:
Thorsten Berger,
Steven She
Abstract:
We reverse-engineer a formal semantics of the Component Definition Language (CDL), which is part of the highly configurable, embedded operating system eCos. This work provides the basis for an analysis and comparison of the two variability-modeling languages Kconfig and CDL. The semantics given in this document are based on analyzing the CDL documentation, inspecting the source code of the toolcha…
▽ More
We reverse-engineer a formal semantics of the Component Definition Language (CDL), which is part of the highly configurable, embedded operating system eCos. This work provides the basis for an analysis and comparison of the two variability-modeling languages Kconfig and CDL. The semantics given in this document are based on analyzing the CDL documentation, inspecting the source code of the toolchain, as well as testing the tools on particular examples.
△ Less
Submitted 23 September, 2022;
originally announced September 2022.
-
Formal Semantics of the Kconfig Language
Authors:
Steven She,
Thorsten Berger
Abstract:
The Kconfig language defines a set of symbols that are assigned a value in a configuration. We describe the semantics of the Kconfig language according to the behavior exhibited in the xconfig configurator. We assume an abstract syntax representation for concepts in the Kconfig language and delegate the details of the translation from concrete to abstract syntaxes to a later document.
The Kconfig language defines a set of symbols that are assigned a value in a configuration. We describe the semantics of the Kconfig language according to the behavior exhibited in the xconfig configurator. We assume an abstract syntax representation for concepts in the Kconfig language and delegate the details of the translation from concrete to abstract syntaxes to a later document.
△ Less
Submitted 11 September, 2022;
originally announced September 2022.
-
Autonomous Mobile Robot Navigation in Uneven and Unstructured Indoor Environments
Authors:
Chaoqun Wang,
Lili Meng,
Sizhen She,
Ian M. Mitchell,
Teng Li,
Frederick Tung,
Weiwei Wan,
Max. Q. -H. Meng,
Clarence W. de Silva
Abstract:
Robots are increasingly operating in indoor environments designed for and shared with people. However, robots working safely and autonomously in uneven and unstructured environments still face great challenges. Many modern indoor environments are designed with wheelchair accessibility in mind. This presents an opportunity for wheeled robots to navigate through sloped areas while avoiding staircase…
▽ More
Robots are increasingly operating in indoor environments designed for and shared with people. However, robots working safely and autonomously in uneven and unstructured environments still face great challenges. Many modern indoor environments are designed with wheelchair accessibility in mind. This presents an opportunity for wheeled robots to navigate through sloped areas while avoiding staircases. In this paper, we present an integrated software and hardware system for autonomous mobile robot navigation in uneven and unstructured indoor environments. This modular and reusable software framework incorporates capabilities of perception and navigation. Our robot first builds a 3D OctoMap representation for the uneven environment with the 3D mapping using wheel odometry, 2D laser and RGB-D data. Then we project multilayer 2D occupancy maps from OctoMap to generate the the traversable map based on layer differences. The safe traversable map serves as the input for efficient autonomous navigation. Furthermore, we employ a variable step size Rapidly Exploring Random Trees that could adjust the step size automatically, eliminating tuning step sizes according to environments. We conduct extensive experiments in simulation and real-world, demonstrating the efficacy and efficiency of our system.
△ Less
Submitted 28 October, 2017;
originally announced October 2017.
-
Far Dissipation Range of Turbulence
Authors:
Shiyi Chen,
Gary Doolen,
Jackson R. Herring,
Robert H. Kraichnan,
Steven A. Orszag,
Zhen Su She
Abstract:
The very small scales of isotropic, Navier-Stokes turbulence at Reynolds number ${\cal R}_λ\approx 15$ are studied by high-resolution direct numerical simulation (DNS) and by integration of the direct-interaction (DIA) equations. The DNS follows the tail of the energy spectrum over more than thirty decades of magnitude. The energy spectrum in the far-dissipation range $5k_d < k < 10k_d$ is well-…
▽ More
The very small scales of isotropic, Navier-Stokes turbulence at Reynolds number ${\cal R}_λ\approx 15$ are studied by high-resolution direct numerical simulation (DNS) and by integration of the direct-interaction (DIA) equations. The DNS follows the tail of the energy spectrum over more than thirty decades of magnitude. The energy spectrum in the far-dissipation range $5k_d < k < 10k_d$ is well-fitted by $k^α\exp(-ck/k_d)$, where $k_d$ is the Kolmogorov dissipation wavenumber, $α\approx 3.3$ and $c\approx 7.1$. For values of $m$ that emphasize the far-dissipation range, the fields $(-\nabla^2)^m{\bf u}$ exhibit strong spatial intermittency, associated with gentle spatial variations of the lower-$k$ part of the velocity field. DIA analysis gives a prefactor $k^3$ and an exponential decay more rapid than DNS. Averaging over an ensemble of DIA solutions, suggested by the observed intermittency, removes some of the discrepancy.
△ Less
Submitted 4 March, 1993;
originally announced March 1993.