-
Benchmarking Complex Instruction-Following with Multiple Constraints Composition
Authors:
Bosi Wen,
Pei Ke,
Xiaotao Gu,
Lindong Wu,
Hao Huang,
Jinfeng Zhou,
Wenchuang Li,
Binxin Hu,
Wendy Gao,
Jiaxin Xu,
Yiming Liu,
Jie Tang,
Hongning Wang,
Minlie Huang
Abstract:
Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on m…
▽ More
Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints, which is an indispensable constituent in complex instructions. To this end, we propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints. We propose a hierarchical taxonomy for complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types, and manually collect a high-quality dataset accordingly. To make the evaluation reliable, we augment LLM-based evaluators with rules to effectively verify whether generated texts can satisfy each constraint and composition. Furthermore, we obtain the final evaluation score based on the dependency structure determined by different composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition.
△ Less
Submitted 11 July, 2024; v1 submitted 4 July, 2024;
originally announced July 2024.
-
Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
Authors:
Zhexin Zhang,
Junxiao Yang,
Pei Ke,
Shiyao Cui,
Chujie Zheng,
Hongning Wang,
Minlie Huang
Abstract:
LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge…
▽ More
LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) based approaches. Our extensive experiments confirmed our insight and suggested surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions \emph{without} any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B on \emph{out-of-distribution} (OOD) harmful questions wrapped with various complex jailbreak prompts from 82.6\% to 7.7\%. This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9\% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution stems from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions, and similarity among their learned representations in the LLM). Our code is available at \url{https://github.com/thu-coai/SafeUnlearning}.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models
Authors:
Jiale Cheng,
Yida Lu,
Xiaotao Gu,
Pei Ke,
Xiao Liu,
Yuxiao Dong,
Hongning Wang,
Jie Tang,
Minlie Huang
Abstract:
Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks. As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically. Traditional benchmarking approaches cannot thoroughly…
▽ More
Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks. As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies, while manual inspections are costly and not scalable. In this paper, we introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks. Inspired by the educational assessment process that measures students' learning outcomes, AutoDetect consists of three LLM-powered agents: Examiner, Questioner, and Assessor. The collaboration among these three agents is designed to realize comprehensive and in-depth weakness identification. Our framework demonstrates significant success in uncovering flaws, with an identification success rate exceeding 30% in prominent models such as ChatGPT and Claude. More importantly, these identified weaknesses can guide specific model improvements, proving more effective than untargeted data augmentation methods like Self-Instruct. Our approach has led to substantial enhancements in popular LLMs, including the Llama series and Mistral-7b, boosting their performance by over 10% across several benchmarks. Code and data are publicly available at https://github.com/thu-coai/AutoDetect.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Learning Task Decomposition to Assist Humans in Competitive Programming
Authors:
Jiaxin Wen,
Ruiqi Zhong,
Pei Ke,
Zhihong Shao,
Hongning Wang,
Minlie Huang
Abstract:
When using language models (LMs) to solve complex problems, humans might struggle to understand the LM-generated solutions and repair the flawed ones. To assist humans in repairing them, we propose to automatically decompose complex solutions into multiple simpler pieces that correspond to specific subtasks. We introduce a novel objective for learning task decomposition, termed assistive value (As…
▽ More
When using language models (LMs) to solve complex problems, humans might struggle to understand the LM-generated solutions and repair the flawed ones. To assist humans in repairing them, we propose to automatically decompose complex solutions into multiple simpler pieces that correspond to specific subtasks. We introduce a novel objective for learning task decomposition, termed assistive value (AssistV), which measures the feasibility and speed for humans to repair the decomposed solution. We collect a dataset of human repair experiences on different decomposed solutions. Utilizing the collected data as in-context examples, we then learn to critique, refine, and rank decomposed solutions to improve AssistV. We validate our method under competitive programming problems: under 177 hours of human study, our method enables non-experts to solve 33.3\% more problems, speeds them up by 3.3x, and empowers them to match unassisted experts.
△ Less
Submitted 17 July, 2024; v1 submitted 6 June, 2024;
originally announced June 2024.
-
Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering
Authors:
Zhihua Wen,
Zhiliang Tian,
Zexin Jian,
Zhen Huang,
Pei Ke,
Yifu Gao,
Minlie Huang,
Dongsheng Li
Abstract:
Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations. The knowledge boundary (KB) of an LLM limits its factual understanding, beyond which it may begin to hallucinate. Investigating the perception of LLMs' KB is crucial for detecting hallucinations and LLMs' reliable generation. Current studies perceive LLMs' KB on questions with a concrete answer (clos…
▽ More
Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations. The knowledge boundary (KB) of an LLM limits its factual understanding, beyond which it may begin to hallucinate. Investigating the perception of LLMs' KB is crucial for detecting hallucinations and LLMs' reliable generation. Current studies perceive LLMs' KB on questions with a concrete answer (close-ended questions) while paying limited attention to semi-open-ended questions (SoeQ) that correspond to many potential answers. Some researchers achieve it by judging whether the question is answerable or not. However, this paradigm is unsuitable for SoeQ, which are usually partially answerable, containing both answerable and ambiguous (unanswerable) answers. Ambiguous answers are essential for knowledge-seeking, but they may go beyond the KB of LLMs. In this paper, we perceive the LLMs' KB with SoeQ by discovering more ambiguous answers. First, we apply an LLM-based approach to construct SoeQ and obtain answers from a target LLM. Unfortunately, the output probabilities of mainstream black-box LLMs are inaccessible to sample for low-probability ambiguous answers. Therefore, we apply an open-sourced auxiliary model to explore ambiguous answers for the target LLM. We calculate the nearest semantic representation for existing answers to estimate their probabilities, with which we reduce the generation probability of high-probability answers to achieve a more effective generation. Finally, we compare the results from the RAG-based evaluation and LLM self-evaluation to categorize four types of ambiguous answers that are beyond the KB of the target LLM. Following our method, we construct a dataset to perceive the KB for GPT-4. We find that GPT-4 performs poorly on SoeQ and is often unaware of its KB. Besides, our auxiliary model, LLaMA-2-13B, is effective in discovering more ambiguous answers.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors
Authors:
Zhexin Zhang,
Yida Lu,
Jingyuan Ma,
Di Zhang,
Rui Li,
Pei Ke,
Hao Sun,
Lei Sha,
Zhifang Sui,
Hongning Wang,
Minlie Huang
Abstract:
The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with general human safety standards, supports customizable detection rules, and…
▽ More
The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with general human safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective in real-world situations as a safety evaluator for advanced LLMs. We release ShieldLM at \url{https://github.com/thu-coai/ShieldLM} to support accurate and explainable safety detection under various safety standards, contributing to the ongoing efforts to enhance the safety of LLMs.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
Towards Efficient Exact Optimization of Language Model Alignment
Authors:
Haozhe Ji,
Cheng Lu,
Yilin Niu,
Pei Ke,
Hongning Wang,
Jun Zhu,
Jie Tang,
Minlie Huang
Abstract:
The alignment of language models with human preferences is vital for their application in real-world tasks. The problem is formulated as optimizing the model's policy to maximize the expected reward that reflects human preferences with minimal deviation from the initial policy. While considered as a straightforward solution, reinforcement learning (RL) suffers from high variance in policy updates,…
▽ More
The alignment of language models with human preferences is vital for their application in real-world tasks. The problem is formulated as optimizing the model's policy to maximize the expected reward that reflects human preferences with minimal deviation from the initial policy. While considered as a straightforward solution, reinforcement learning (RL) suffers from high variance in policy updates, which impedes efficient policy improvement. Recently, direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. However, we show that DPO derived based on the optimal solution of the problem leads to a compromised mean-seeking approximation of the optimal solution in practice. In this paper, we propose efficient exact optimization (EXO) of the alignment objective. EXO is guaranteed to optimize in the same direction as RL algorithms asymptotically for arbitrary policy parametrization. This leads to the same mode-seeking solution, while enables efficient optimization by circumventing the complexities of RL. We also compare our method to DPO with both theoretical and empirical analyses, and further demonstrate the advantages of our method over existing approaches on realistic human preference data. Code is available at https://github.com/haozheji/exact-optimization.
△ Less
Submitted 5 June, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
The 4-adic complexity of quaternary sequences with low autocorrelation and high linear complexity
Authors:
Feifei Yan,
Pinhui Ke,
Lingmei Xiao
Abstract:
Recently, Jiang et al. proposed several new classes of quaternary sequences with low autocorrelation and high linear complexity by using the inverse Gray mapping (JAMC, \textbf{69} (2023): 689--706). In this paper, we estimate the 4-adic complexity of these quaternary sequences. Our results show that these sequences have large 4-adic complexity to resist the attack of the rational approximation al…
▽ More
Recently, Jiang et al. proposed several new classes of quaternary sequences with low autocorrelation and high linear complexity by using the inverse Gray mapping (JAMC, \textbf{69} (2023): 689--706). In this paper, we estimate the 4-adic complexity of these quaternary sequences. Our results show that these sequences have large 4-adic complexity to resist the attack of the rational approximation algorithm.
△ Less
Submitted 6 January, 2024;
originally announced January 2024.
-
AlignBench: Benchmarking Chinese Alignment of Large Language Models
Authors:
Xiao Liu,
Xuanyu Lei,
Shengyuan Wang,
Yue Huang,
Zhuoer Feng,
Bosi Wen,
Jiale Cheng,
Pei Ke,
Yifan Xu,
Weng Lam Tam,
Xiaohan Zhang,
Lichao Sun,
Hongning Wang,
Jing Zhang,
Minlie Huang,
Yuxiao Dong,
Jie Tang
Abstract:
Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. To fill in this gap, we introduce AlignBench, a comprehensive multi-dim…
▽ More
Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. Equipped with a human-in-the-loop data curation pipeline, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge with Chain-of-Thought to generate explanations and final ratings as evaluations, ensuring high reliability and interpretability. Furthermore, we report AlignBench evaluated by CritiqueLLM, a dedicated Chinese evaluator LLM that recovers 95% of GPT-4's evaluation ability. We will provide public APIs for evaluating AlignBench with CritiqueLLM to facilitate the evaluation of LLMs' Chinese alignment. All evaluation codes, data, and LLM generations are available at \url{https://github.com/THUDM/AlignBench}.
△ Less
Submitted 5 December, 2023; v1 submitted 30 November, 2023;
originally announced November 2023.
-
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
Authors:
Pei Ke,
Bosi Wen,
Zhuoer Feng,
Xiao Liu,
Xuanyu Lei,
Jiale Cheng,
Shengyuan Wang,
Aohan Zeng,
Yuxiao Dong,
Hongning Wang,
Jie Tang,
Minlie Huang
Abstract:
Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4's direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise c…
▽ More
Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4's direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.
△ Less
Submitted 26 June, 2024; v1 submitted 30 November, 2023;
originally announced November 2023.
-
Unveiling the Implicit Toxicity in Large Language Models
Authors:
Jiaxin Wen,
Pei Ke,
Hao Sun,
Zhexin Zhang,
Chengfei Li,
Jinfeng Bai,
Minlie Huang
Abstract:
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. While recent studies primarily focus on probing toxic outputs that can be easily detected with existing toxicity classifiers, we show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via…
▽ More
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. While recent studies primarily focus on probing toxic outputs that can be easily detected with existing toxicity classifiers, we show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. Moreover, we propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs. Specifically, we optimize the language model with a reward that prefers implicit toxic outputs to explicit toxic and non-toxic ones. Experiments on five widely-adopted toxicity classifiers demonstrate that the attack success rate can be significantly improved through RL fine-tuning. For instance, the RL-finetuned LLaMA-13B model achieves an attack success rate of 90.04% on BAD and 62.85% on Davinci003. Our findings suggest that LLMs pose a significant threat in generating undetectable implicit toxic outputs. We further show that fine-tuning toxicity classifiers on the annotated examples from our attacking method can effectively enhance their ability to detect LLM-generated implicit toxic language. The code is publicly available at https://github.com/thu-coai/Implicit-Toxicity.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
Authors:
Zhexin Zhang,
Junxiao Yang,
Pei Ke,
Fei Mi,
Hongning Wang,
Minlie Huang
Abstract:
While significant attention has been dedicated to exploiting weaknesses in LLMs through jailbreaking attacks, there remains a paucity of effort in defending against these attacks. We point out a pivotal factor contributing to the success of jailbreaks: the intrinsic conflict between the goals of being helpful and ensuring safety. Accordingly, we propose to integrate goal prioritization at both tra…
▽ More
While significant attention has been dedicated to exploiting weaknesses in LLMs through jailbreaking attacks, there remains a paucity of effort in defending against these attacks. We point out a pivotal factor contributing to the success of jailbreaks: the intrinsic conflict between the goals of being helpful and ensuring safety. Accordingly, we propose to integrate goal prioritization at both training and inference stages to counteract. Implementing goal prioritization during inference substantially diminishes the Attack Success Rate (ASR) of jailbreaking from 66.4% to 3.6% for ChatGPT. And integrating goal prioritization into model training reduces the ASR from 71.0% to 6.6% for Llama2-13B. Remarkably, even in scenarios where no jailbreaking samples are included during training, our approach slashes the ASR by half. Additionally, our findings reveal that while stronger LLMs face greater safety risks, they also possess a greater capacity to be steered towards defending against such attacks, both because of their stronger ability in instruction following. Our work thus contributes to the comprehension of jailbreaking attacks and defenses, and sheds light on the relationship between LLMs' capability and safety. Our code is available at \url{https://github.com/thu-coai/JailbreakDefense_GoalPriority}.
△ Less
Submitted 12 June, 2024; v1 submitted 15 November, 2023;
originally announced November 2023.
-
Black-Box Prompt Optimization: Aligning Large Language Models without Model Training
Authors:
Jiale Cheng,
Xiao Liu,
Kehan Zheng,
Pei Ke,
Hongning Wang,
Yuxiao Dong,
Jie Tang,
Minlie Huang
Abstract:
Large language models (LLMs) have shown impressive success in various applications. However, these models are often not well aligned with human intents, which calls for additional treatments on them; that is, the alignment problem. To make LLMs better follow user instructions, existing alignment methods primarily focus on further training them. However, the extra training of LLMs is usually expens…
▽ More
Large language models (LLMs) have shown impressive success in various applications. However, these models are often not well aligned with human intents, which calls for additional treatments on them; that is, the alignment problem. To make LLMs better follow user instructions, existing alignment methods primarily focus on further training them. However, the extra training of LLMs is usually expensive in terms of GPU computing; even worse, some LLMs are not accessible for user-demanded training, such as GPTs. In this work, we take a different perspective -- Black-Box Prompt Optimization (BPO) -- to perform alignments. The idea is to optimize user prompts to suit LLMs' input understanding, so as to best realize users' intents without updating LLMs' parameters. BPO leverages human preferences to optimize prompts, thus making it superior to LLM (e.g., ChatGPT) as a prompt engineer. Moreover, BPO is model-agnostic, and the empirical results demonstrate that the BPO-aligned ChatGPT yields a 22% increase in the win rate against its original version and 10% for GPT-4. Notably, the BPO-aligned LLMs can outperform the same models aligned by PPO and DPO, and it also brings additional performance gains when combining BPO with PPO or DPO. Code and datasets are released at https://github.com/thu-coai/BPO.
△ Less
Submitted 21 June, 2024; v1 submitted 7 November, 2023;
originally announced November 2023.
-
Language Model Decoding as Direct Metrics Optimization
Authors:
Haozhe Ji,
Pei Ke,
Hongning Wang,
Minlie Huang
Abstract:
Despite the remarkable advances in language modeling, current mainstream decoding methods still struggle to generate texts that align with human texts across different aspects. In particular, sampling-based methods produce less-repetitive texts which are often disjunctive in discourse, while search-based methods maintain topic coherence at the cost of increased repetition. Overall, these methods f…
▽ More
Despite the remarkable advances in language modeling, current mainstream decoding methods still struggle to generate texts that align with human texts across different aspects. In particular, sampling-based methods produce less-repetitive texts which are often disjunctive in discourse, while search-based methods maintain topic coherence at the cost of increased repetition. Overall, these methods fall short in achieving holistic alignment across a broad range of aspects. In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts measured by multiple metrics of desired aspects simultaneously. The resulting decoding distribution enjoys an analytical solution that scales the input language model distribution via a sequence-level energy function defined by these metrics. And most importantly, we prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts. To facilitate tractable sampling from this globally normalized distribution, we adopt the Sampling-Importance-Resampling technique. Experiments on various domains and model scales demonstrate the superiority of our method in metrics alignment with human texts and human evaluation over strong baselines.
△ Less
Submitted 5 June, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
-
DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering
Authors:
Pei Ke,
Fei Huang,
Fei Mi,
Yasheng Wang,
Qun Liu,
Xiaoyan Zhu,
Minlie Huang
Abstract:
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability. Specifically, most of the well-performed metrics are required to train on evaluation datasets of specific NLG tasks and evaluation dimensions, which may cause over-fitting to task-specific datasets. Furthermore, existing metrics only provide an evaluation scor…
▽ More
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability. Specifically, most of the well-performed metrics are required to train on evaluation datasets of specific NLG tasks and evaluation dimensions, which may cause over-fitting to task-specific datasets. Furthermore, existing metrics only provide an evaluation score for each dimension without revealing the evidence to interpret how this score is obtained. To deal with these challenges, we propose a simple yet effective metric called DecompEval. This metric formulates NLG evaluation as an instruction-style question answering task and utilizes instruction-tuned pre-trained language models (PLMs) without training on evaluation datasets, aiming to enhance the generalization ability. To make the evaluation process more interpretable, we decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence. The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result. Experimental results show that DecompEval achieves state-of-the-art performance in untrained metrics for evaluating text summarization and dialogue generation, which also exhibits strong dimension-level / task-level generalization ability and interpretability.
△ Less
Submitted 13 July, 2023;
originally announced July 2023.
-
Click: Controllable Text Generation with Sequence Likelihood Contrastive Learning
Authors:
Chujie Zheng,
Pei Ke,
Zheng Zhang,
Minlie Huang
Abstract:
It has always been an important yet challenging problem to control language models to avoid generating texts with undesirable attributes, such as toxic language and unnatural repetition. We introduce Click for controllable text generation, which needs no modification to the model architecture and facilitates out-of-the-box use of trained models. It employs a contrastive loss on sequence likelihood…
▽ More
It has always been an important yet challenging problem to control language models to avoid generating texts with undesirable attributes, such as toxic language and unnatural repetition. We introduce Click for controllable text generation, which needs no modification to the model architecture and facilitates out-of-the-box use of trained models. It employs a contrastive loss on sequence likelihood, which fundamentally decreases the generation probability of negative samples (i.e., generations with undesirable attributes). It also adopts a novel likelihood ranking-based strategy to construct contrastive samples from model generations. On the tasks of language detoxification, sentiment steering, and repetition reduction, we show that Click outperforms strong baselines of controllable text generation and demonstrate the superiority of Click's sample construction strategy.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Directed Acyclic Transformer Pre-training for High-quality Non-autoregressive Text Generation
Authors:
Fei Huang,
Pei Ke,
Minlie Huang
Abstract:
Non-AutoRegressive (NAR) text generation models have drawn much attention because of their significantly faster decoding speed and good generation quality in machine translation. However, in a wider range of text generation tasks, existing NAR models lack proper pre-training, making them still far behind the pre-trained autoregressive models. In this paper, we propose Pre-trained Directed Acyclic…
▽ More
Non-AutoRegressive (NAR) text generation models have drawn much attention because of their significantly faster decoding speed and good generation quality in machine translation. However, in a wider range of text generation tasks, existing NAR models lack proper pre-training, making them still far behind the pre-trained autoregressive models. In this paper, we propose Pre-trained Directed Acyclic Transformer (PreDAT) and a novel pre-training task to promote prediction consistency in NAR generation. Experiments on five text generation tasks show that our PreDAT remarkably outperforms existing pre-trained NAR models (+4.2 scores on average) and even achieves better results than pre-trained autoregressive baselines in n-gram-based metrics, along with 17 times speedup in throughput. Further analysis shows that PreDAT benefits from the unbiased prediction order that alleviates the error accumulation problem in autoregressive generation, which provides new insights into the advantages of NAR generation.
△ Less
Submitted 23 April, 2023;
originally announced April 2023.
-
Tailoring Language Generation Models under Total Variation Distance
Authors:
Haozhe Ji,
Pei Ke,
Zhipeng Hu,
Rongsheng Zhang,
Minlie Huang
Abstract:
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method. From a distributional view, MLE in fact minimizes the Kullback-Leibler divergence (KLD) between the distribution of the real data and that of the model. However, this approach forces the model to distribute non-zero (sometimes large) probability mass to all training samples rega…
▽ More
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method. From a distributional view, MLE in fact minimizes the Kullback-Leibler divergence (KLD) between the distribution of the real data and that of the model. However, this approach forces the model to distribute non-zero (sometimes large) probability mass to all training samples regardless of their quality. Moreover, in the attempt to cover the low-probability regions in the data distribution, the model systematically overestimates the probability of corrupted text sequences, which we conjecture is one of the main reasons for text degeneration during autoregressive decoding. To remedy this problem, we leverage the total variation distance (TVD) with its robustness to outliers, and develop practical bounds to apply it to language generation. Then, we introduce the TaiLr objective that balances the tradeoff of estimating TVD. Intuitively, TaiLr downweights real data samples that have low model probabilities with tunable penalization intensity. Experimental results show that our method alleviates the overestimation of degenerated sequences without sacrificing diversity and improves generation quality on a wide range of text generation tasks.
△ Less
Submitted 26 February, 2023;
originally announced February 2023.
-
Technical Report: Automating Vehicle SOA Threat Analysis using a Model-Based Methodology
Authors:
Yuri Gil Dantas,
Simon Barner,
Pei Ke,
Vivek Nigam,
Ulrich Schoepp
Abstract:
While the adoption of Service-Oriented Architectures (SOA) eases the implementation of features such as autonomous driving and over-the-air updates, it also increases the vehicle's exposure to attacks that may place road-users in harm. To address this problem, standards (ISO 21434/UNECE) expect manufacturers to produce security arguments and evidence by carrying out appropriate threat analysis. As…
▽ More
While the adoption of Service-Oriented Architectures (SOA) eases the implementation of features such as autonomous driving and over-the-air updates, it also increases the vehicle's exposure to attacks that may place road-users in harm. To address this problem, standards (ISO 21434/UNECE) expect manufacturers to produce security arguments and evidence by carrying out appropriate threat analysis. As key threat analysis steps, e.g., damage/threat scenario and attack path enumeration, are often carried out manually and not rigorously, security arguments lack precise guarantees, e.g., traceability w.r.t. safety goals, especially under system updates. This article proposes automated methods for threat analysis using a model-based engineering methodology that provides precise guarantees with respect to safety goals. This is accomplished by proposing an intruder model for automotive SOA which together with the system architecture and the loss scenarios identified by safety analysis are used as input for computing assets, impact rating, damage/threat scenarios, and attack paths. To validate the proposed methodology, we developed a faithful model of the autonomous driving functions of the Apollo framework, a widely used open-source autonomous driving stack. The proposed machinery automatically enumerates several attack paths on Apollo, including attack paths not reported in the literature.
△ Less
Submitted 23 December, 2022;
originally announced December 2022.
-
Learning Instructions with Unlabeled Data for Zero-Shot Cross-Task Generalization
Authors:
Yuxian Gu,
Pei Ke,
Xiaoyan Zhu,
Minlie Huang
Abstract:
Training language models to learn from human instructions for zero-shot cross-task generalization has attracted much attention in NLP communities. Recently, instruction tuning (IT), which fine-tunes a pre-trained language model on a massive collection of tasks described via human-craft instructions, has been shown effective in instruction learning for unseen tasks. However, IT relies on a large am…
▽ More
Training language models to learn from human instructions for zero-shot cross-task generalization has attracted much attention in NLP communities. Recently, instruction tuning (IT), which fine-tunes a pre-trained language model on a massive collection of tasks described via human-craft instructions, has been shown effective in instruction learning for unseen tasks. However, IT relies on a large amount of human-annotated samples, which restricts its generalization. Unlike labeled data, unlabeled data are often massive and cheap to obtain. In this work, we study how IT can be improved with unlabeled data. We first empirically explore the IT performance trends versus the number of labeled data, instructions, and training tasks. We find it critical to enlarge the number of training instructions, and the instructions can be underutilized due to the scarcity of labeled data. Then, we propose Unlabeled Data Augmented Instruction Tuning (UDIT) to take better advantage of the instructions during IT by constructing pseudo-labeled data from unlabeled plain texts. We conduct extensive experiments to show UDIT's effectiveness in various scenarios of tasks and datasets. We also comprehensively analyze the key factors of UDIT to investigate how to better improve IT with unlabeled data. The code is publicly available at https://github.com/thu-coai/UDIT.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
Curriculum-Based Self-Training Makes Better Few-Shot Learners for Data-to-Text Generation
Authors:
Pei Ke,
Haozhe Ji,
Zhenyu Yang,
Yi Huang,
Junlan Feng,
Xiaoyan Zhu,
Minlie Huang
Abstract:
Despite the success of text-to-text pre-trained models in various natural language generation (NLG) tasks, the generation performance is largely restricted by the number of labeled data in downstream tasks, particularly in data-to-text generation tasks. Existing works mostly utilize abundant unlabeled structured data to conduct unsupervised pre-training for task adaption, which fail to model the c…
▽ More
Despite the success of text-to-text pre-trained models in various natural language generation (NLG) tasks, the generation performance is largely restricted by the number of labeled data in downstream tasks, particularly in data-to-text generation tasks. Existing works mostly utilize abundant unlabeled structured data to conduct unsupervised pre-training for task adaption, which fail to model the complex relationship between source structured data and target texts. Thus, we introduce self-training as a better few-shot learner than task-adaptive pre-training, which explicitly captures this relationship via pseudo-labeled data generated by the pre-trained model. To alleviate the side-effect of low-quality pseudo-labeled data during self-training, we propose a novel method called Curriculum-Based Self-Training (CBST) to effectively leverage unlabeled data in a rearranged order determined by the difficulty of text generation. Experimental results show that our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation
Authors:
Pei Ke,
Hao Zhou,
Yankai Lin,
Peng Li,
Jie Zhou,
Xiaoyan Zhu,
Minlie Huang
Abstract:
Existing reference-free metrics have obvious limitations for evaluating controlled text generation models. Unsupervised metrics can only provide a task-agnostic evaluation result which correlates weakly with human judgments, whereas supervised ones may overfit task-specific data with poor generalization ability to other datasets. In this paper, we propose an unsupervised reference-free metric call…
▽ More
Existing reference-free metrics have obvious limitations for evaluating controlled text generation models. Unsupervised metrics can only provide a task-agnostic evaluation result which correlates weakly with human judgments, whereas supervised ones may overfit task-specific data with poor generalization ability to other datasets. In this paper, we propose an unsupervised reference-free metric called CTRLEval, which evaluates controlled text generation from different aspects by formulating each aspect into multiple text infilling tasks. On top of these tasks, the metric assembles the generation probabilities from a pre-trained language model without any model training. Experimental results show that our metric has higher correlations with human judgments than other baselines, while obtaining better generalization of evaluating generated texts from different models and with different qualities.
△ Less
Submitted 5 December, 2022; v1 submitted 2 April, 2022;
originally announced April 2022.
-
EVA2.0: Investigating Open-Domain Chinese Dialogue Systems with Large-Scale Pre-Training
Authors:
Yuxian Gu,
Jiaxin Wen,
Hao Sun,
Yi Song,
Pei Ke,
Chujie Zheng,
Zheng Zhang,
Jianzhu Yao,
Lei Liu,
Xiaoyan Zhu,
Minlie Huang
Abstract:
Large-scale pre-training has shown remarkable performance in building open-domain dialogue systems. However, previous works mainly focus on showing and evaluating the conversational performance of the released dialogue model, ignoring the discussion of some key factors towards a powerful human-like chatbot, especially in Chinese scenarios. In this paper, we conduct extensive experiments to investi…
▽ More
Large-scale pre-training has shown remarkable performance in building open-domain dialogue systems. However, previous works mainly focus on showing and evaluating the conversational performance of the released dialogue model, ignoring the discussion of some key factors towards a powerful human-like chatbot, especially in Chinese scenarios. In this paper, we conduct extensive experiments to investigate these under-explored factors, including data quality control, model architecture designs, training approaches, and decoding strategies. We propose EVA2.0, a large-scale pre-trained open-domain Chinese dialogue model with 2.8 billion parameters, and will make our models and codes publicly available. Automatic and human evaluations show that EVA2.0 significantly outperforms other open-source counterparts. We also discuss the limitations of this work by presenting some failure cases and pose some future research directions on large-scale Chinese open-domain dialogue systems.
△ Less
Submitted 21 October, 2023; v1 submitted 17 March, 2022;
originally announced March 2022.
-
Rethinking and Refining the Distinct Metric
Authors:
Siyang Liu,
Sahand Sabour,
Yinhe Zheng,
Pei Ke,
Xiaoyan Zhu,
Minlie Huang
Abstract:
Distinct-$n$ score\cite{Li2016} is a widely used automatic metric for evaluating diversity in language generation tasks. However, we observed that the original approach for calculating distinct scores has evident biases that tend to assign higher penalties to longer sequences. We refine the calculation of distinct scores by scaling the number of distinct tokens based on their expectations. We prov…
▽ More
Distinct-$n$ score\cite{Li2016} is a widely used automatic metric for evaluating diversity in language generation tasks. However, we observed that the original approach for calculating distinct scores has evident biases that tend to assign higher penalties to longer sequences. We refine the calculation of distinct scores by scaling the number of distinct tokens based on their expectations. We provide both empirical and theoretical evidence to show that our method effectively removes the biases existing in the original distinct score. Our experiments show that our proposed metric, \textit{Expectation-Adjusted Distinct (EAD)}, correlates better with human judgment in evaluating response diversity. To foster future research, we provide an example implementation at \url{https://github.com/lsy641/Expectation-Adjusted-Distinct}.
△ Less
Submitted 3 April, 2022; v1 submitted 28 February, 2022;
originally announced February 2022.
-
EVA: An Open-Domain Chinese Dialogue System with Large-Scale Generative Pre-Training
Authors:
Hao Zhou,
Pei Ke,
Zheng Zhang,
Yuxian Gu,
Yinhe Zheng,
Chujie Zheng,
Yida Wang,
Chen Henry Wu,
Hao Sun,
Xiaocong Yang,
Bosi Wen,
Xiaoyan Zhu,
Minlie Huang,
Jie Tang
Abstract:
Although pre-trained language models have remarkably enhanced the generation ability of dialogue systems, open-domain Chinese dialogue systems are still limited by the dialogue data and the model size compared with English ones. In this paper, we propose EVA, a Chinese dialogue system that contains the largest Chinese pre-trained dialogue model with 2.8B parameters. To build this model, we collect…
▽ More
Although pre-trained language models have remarkably enhanced the generation ability of dialogue systems, open-domain Chinese dialogue systems are still limited by the dialogue data and the model size compared with English ones. In this paper, we propose EVA, a Chinese dialogue system that contains the largest Chinese pre-trained dialogue model with 2.8B parameters. To build this model, we collect the largest Chinese dialogue dataset named WDC-Dialogue from various public social media. This dataset contains 1.4B context-response pairs and is used as the pre-training corpus of EVA. Extensive experiments on automatic and human evaluation show that EVA outperforms other Chinese pre-trained dialogue models especially in the multi-turn interaction of human-bot conversations.
△ Less
Submitted 3 August, 2021;
originally announced August 2021.
-
CPM-2: Large-scale Cost-effective Pre-trained Language Models
Authors:
Zhengyan Zhang,
Yuxian Gu,
Xu Han,
Shengqi Chen,
Chaojun Xiao,
Zhenbo Sun,
Yuan Yao,
Fanchao Qi,
Jian Guan,
Pei Ke,
Yanzheng Cai,
Guoyang Zeng,
Zhixing Tan,
Zhiyuan Liu,
Minlie Huang,
Wentao Han,
Yang Liu,
Xiaoyan Zhu,
Maosong Sun
Abstract:
In recent years, the size of pre-trained language models (PLMs) has grown by leaps and bounds. However, efficiency issues of these large-scale PLMs limit their utilization in real-world scenarios. We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. (1) We introduce knowledge inheritance to accelerate th…
▽ More
In recent years, the size of pre-trained language models (PLMs) has grown by leaps and bounds. However, efficiency issues of these large-scale PLMs limit their utilization in real-world scenarios. We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. (1) We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. (2) We explore the best practice of prompt tuning with large-scale PLMs. Compared with conventional fine-tuning, prompt tuning significantly reduces the number of task-specific parameters. (3) We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources. Based on our cost-effective pipeline, we pre-train two models: an encoder-decoder bilingual model with 11 billion parameters (CPM-2) and its corresponding MoE version with 198 billion parameters. In our experiments, we compare CPM-2 with mT5 on downstream tasks. Experimental results show that CPM-2 has excellent general language intelligence. Moreover, we validate the efficiency of InfMoE when conducting inference of large-scale models having tens of billions of parameters on a single GPU. All source code and model parameters are available at https://github.com/TsinghuaAI/CPM.
△ Less
Submitted 24 June, 2021; v1 submitted 20 June, 2021;
originally announced June 2021.
-
JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs
Authors:
Pei Ke,
Haozhe Ji,
Yu Ran,
Xin Cui,
Liwei Wang,
Linfeng Song,
Xiaoyan Zhu,
Minlie Huang
Abstract:
Existing pre-trained models for knowledge-graph-to-text (KG-to-text) generation simply fine-tune text-to-text pre-trained models such as BART or T5 on KG-to-text datasets, which largely ignore the graph structure during encoding and lack elaborate pre-training tasks to explicitly model graph-text alignments. To tackle these problems, we propose a graph-text joint representation learning model call…
▽ More
Existing pre-trained models for knowledge-graph-to-text (KG-to-text) generation simply fine-tune text-to-text pre-trained models such as BART or T5 on KG-to-text datasets, which largely ignore the graph structure during encoding and lack elaborate pre-training tasks to explicitly model graph-text alignments. To tackle these problems, we propose a graph-text joint representation learning model called JointGT. During encoding, we devise a structure-aware semantic aggregation module which is plugged into each Transformer layer to preserve the graph structure. Furthermore, we propose three new pre-training tasks to explicitly enhance the graph-text alignment including respective text / graph reconstruction, and graph-text alignment in the embedding space via Optimal Transport. Experiments show that JointGT obtains new state-of-the-art performance on various KG-to-text datasets.
△ Less
Submitted 19 June, 2021;
originally announced June 2021.
-
Semantic-Enhanced Explainable Finetuning for Open-Domain Dialogues
Authors:
Yinhe Zheng,
Yida Wang,
Pei Ke,
Zhenyu Yang,
Minlie Huang
Abstract:
This paper propose to combine pretrained language models with the modular dialogue paradigm for open-domain dialogue modeling. Our method, semantic-enhanced finetuning, instantiates conversation understanding, planning, and response generation as a language model finetuning task. At inference, we disentangle semantic and token variations by specifying sampling methods and constraints for each modu…
▽ More
This paper propose to combine pretrained language models with the modular dialogue paradigm for open-domain dialogue modeling. Our method, semantic-enhanced finetuning, instantiates conversation understanding, planning, and response generation as a language model finetuning task. At inference, we disentangle semantic and token variations by specifying sampling methods and constraints for each module separately. For training and evaluation, we present X-Weibo, a Chinese multi-turn open-domain dialogue dataset with automatic annotation for emotions, DAs, and topical words. Experiments show that semantic-enhanced finetuning outperforms strong baselines on non-semantic and semantic metrics, improves the human-evaluated relevance, coherence, and informativeness, and exhibits considerable controllability over semantic variables.
△ Less
Submitted 23 May, 2022; v1 submitted 6 June, 2021;
originally announced June 2021.
-
CPM: A Large-scale Generative Chinese Pre-trained Language Model
Authors:
Zhengyan Zhang,
Xu Han,
Hao Zhou,
Pei Ke,
Yuxian Gu,
Deming Ye,
Yujia Qin,
Yusheng Su,
Haozhe Ji,
Jian Guan,
Fanchao Qi,
Xiaozhi Wang,
Yanan Zheng,
Guoyang Zeng,
Huanqi Cao,
Shengqi Chen,
Daixuan Li,
Zhenbo Sun,
Zhiyuan Liu,
Minlie Huang,
Wentao Han,
Jie Tang,
Juanzi Li,
Xiaoyan Zhu,
Maosong Sun
Abstract:
Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3, with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus of GPT-3 is primarily English, and the parameters a…
▽ More
Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3, with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data. To the best of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained language model, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation, cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many NLP tasks in the settings of few-shot (even zero-shot) learning. The code and parameters are available at https://github.com/TsinghuaAI/CPM-Generate.
△ Less
Submitted 1 December, 2020;
originally announced December 2020.
-
Estimates of daily ground-level NO2 concentrations in China based on big data and machine learning approaches
Authors:
Xinyu Dou,
Cuijuan Liao,
Hengqi Wang,
Ying Huang,
Ying Tu,
Xiaomeng Huang,
Yiran Peng,
Biqing Zhu,
Jianguang Tan,
Zhu Deng,
Nana Wu,
Taochun Sun,
Piyu Ke,
Zhu Liu
Abstract:
Nitrogen dioxide (NO2) is one of the most important atmospheric pollutants. However, current ground-level NO2 concentration data are lack of either high-resolution coverage or full coverage national wide, due to the poor quality of source data and the computing power of the models. To our knowledge, this study is the first to estimate the ground-level NO2 concentration in China with national cover…
▽ More
Nitrogen dioxide (NO2) is one of the most important atmospheric pollutants. However, current ground-level NO2 concentration data are lack of either high-resolution coverage or full coverage national wide, due to the poor quality of source data and the computing power of the models. To our knowledge, this study is the first to estimate the ground-level NO2 concentration in China with national coverage as well as relatively high spatiotemporal resolution (0.25 degree; daily intervals) over the newest past 6 years (2013-2018). We advanced a Random Forest model integrated K-means (RF-K) for the estimates with multi-source parameters. Besides meteorological parameters, satellite retrievals parameters, we also, for the first time, introduce socio-economic parameters to assess the impact by human activities. The results show that: (1) the RF-K model we developed shows better prediction performance than other models, with cross-validation R2 = 0.64 (MAPE = 34.78%). (2) The annual average concentration of NO2 in China showed a weak increasing trend . While in the economic zones such as Beijing-Tianjin-Hebei region, Yangtze River Delta, and Pearl River Delta, the NO2 concentration there even decreased or remained unchanged, especially in spring. Our dataset has verified that pollutant controlling targets have been achieved in these areas. With mapping daily nationwide ground-level NO2 concentrations, this study provides timely data with high quality for air quality management for China. We provide a universal model framework to quickly generate a timely national atmospheric pollutants concentration map with a high spatial-temporal resolution, based on improved machine learning methods.
△ Less
Submitted 17 November, 2020;
originally announced November 2020.
-
Generating Commonsense Explanation by Extracting Bridge Concepts from Reasoning Paths
Authors:
Haozhe Ji,
Pei Ke,
Shaohan Huang,
Furu Wei,
Minlie Huang
Abstract:
Commonsense explanation generation aims to empower the machine's sense-making capability by generating plausible explanations to statements against commonsense. While this task is easy to human, the machine still struggles to generate reasonable and informative explanations. In this work, we propose a method that first extracts the underlying concepts which are served as \textit{bridges} in the re…
▽ More
Commonsense explanation generation aims to empower the machine's sense-making capability by generating plausible explanations to statements against commonsense. While this task is easy to human, the machine still struggles to generate reasonable and informative explanations. In this work, we propose a method that first extracts the underlying concepts which are served as \textit{bridges} in the reasoning chain and then integrates these concepts to generate the final explanation. To facilitate the reasoning process, we utilize external commonsense knowledge to build the connection between a statement and the bridge concepts by extracting and pruning multi-hop paths to build a subgraph. We design a bridge concept extraction model that first scores the triples, routes the paths in the subgraph, and further selects bridge concepts with weak supervision at both the triple level and the concept level. We conduct experiments on the commonsense explanation generation task and our model outperforms the state-of-the-art baselines in both automatic and human evaluation.
△ Less
Submitted 24 September, 2020;
originally announced September 2020.
-
Language Generation with Multi-Hop Reasoning on Commonsense Knowledge Graph
Authors:
Haozhe Ji,
Pei Ke,
Shaohan Huang,
Furu Wei,
Xiaoyan Zhu,
Minlie Huang
Abstract:
Despite the success of generative pre-trained language models on a series of text generation tasks, they still suffer in cases where reasoning over underlying commonsense knowledge is required during generation. Existing approaches that integrate commonsense knowledge into generative pre-trained language models simply transfer relational knowledge by post-training on individual knowledge triples w…
▽ More
Despite the success of generative pre-trained language models on a series of text generation tasks, they still suffer in cases where reasoning over underlying commonsense knowledge is required during generation. Existing approaches that integrate commonsense knowledge into generative pre-trained language models simply transfer relational knowledge by post-training on individual knowledge triples while ignoring rich connections within the knowledge graph. We argue that exploiting both the structural and semantic information of the knowledge graph facilitates commonsense-aware text generation. In this paper, we propose Generation with Multi-Hop Reasoning Flow (GRF) that enables pre-trained models with dynamic multi-hop reasoning on multi-relational paths extracted from the external commonsense knowledge graph. We empirically show that our model outperforms existing baselines on three text generation tasks that require reasoning over commonsense knowledge. We also demonstrate the effectiveness of the dynamic multi-hop reasoning module with reasoning paths inferred by the model that provide rationale to the generation.
△ Less
Submitted 24 September, 2020;
originally announced September 2020.
-
A Large-Scale Chinese Short-Text Conversation Dataset
Authors:
Yida Wang,
Pei Ke,
Yinhe Zheng,
Kaili Huang,
Yong Jiang,
Xiaoyan Zhu,
Minlie Huang
Abstract:
The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million d…
▽ More
The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at https://github.com/thu-coai/CDial-GPT.
△ Less
Submitted 26 April, 2022; v1 submitted 10 August, 2020;
originally announced August 2020.
-
CoTK: An Open-Source Toolkit for Fast Development and Fair Evaluation of Text Generation
Authors:
Fei Huang,
Dazhen Wan,
Zhihong Shao,
Pei Ke,
Jian Guan,
Yilin Niu,
Xiaoyan Zhu,
Minlie Huang
Abstract:
In text generation evaluation, many practical issues, such as inconsistent experimental settings and metric implementations, are often ignored but lead to unfair evaluation and untenable conclusions. We present CoTK, an open-source toolkit aiming to support fast development and fair evaluation of text generation. In model development, CoTK helps handle the cumbersome issues, such as data processin…
▽ More
In text generation evaluation, many practical issues, such as inconsistent experimental settings and metric implementations, are often ignored but lead to unfair evaluation and untenable conclusions. We present CoTK, an open-source toolkit aiming to support fast development and fair evaluation of text generation. In model development, CoTK helps handle the cumbersome issues, such as data processing, metric implementation, and reproduction. It standardizes the development steps and reduces human errors which may lead to inconsistent experimental settings. In model evaluation, CoTK provides implementation for many commonly used metrics and benchmark models across different experimental settings. As a unique feature, CoTK can signify when and which metric cannot be fairly compared. We demonstrate that it is convenient to use CoTK for model development and evaluation, particularly across different experimental settings.
△ Less
Submitted 3 February, 2020;
originally announced February 2020.
-
New Successor Rules to Efficiently Produce Exponentially Many Binary de Bruijn Sequences
Authors:
Zuling Chang,
Martianus Frederic Ezerman,
Pinhui Ke,
Qiang Wang
Abstract:
We put forward new general criteria to design successor rules that generate binary de Bruijn sequences. Prior fast algorithms based on successor rules in the literature are then shown to be special instances. We implemented the criteria to join the cycles generated by a number of simple feedback shift registers (FSRs) of order $n$. These include the pure cycling register (PCR) and the pure summing…
▽ More
We put forward new general criteria to design successor rules that generate binary de Bruijn sequences. Prior fast algorithms based on successor rules in the literature are then shown to be special instances. We implemented the criteria to join the cycles generated by a number of simple feedback shift registers (FSRs) of order $n$. These include the pure cycling register (PCR) and the pure summing register (PSR). For the PCR, we define a transitive relation on its cycles, based on their weights. We also extend the choices of conjugate states by using shift operations. For the PSR, we define three distinct transitive relations on its cycles, namely a run order, a necklace order, and a mixed order. Using the new orders, we propose numerous classes of successor rules. Each class efficiently generates a number, exponential in $n$, of binary de Bruijn sequences. Producing the next bit in each such sequence takes $O(n)$ memory and $O(n)$ time. We implemented computational routines to confirm the claims.
△ Less
Submitted 5 July, 2021; v1 submitted 15 November, 2019;
originally announced November 2019.
-
SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge
Authors:
Pei Ke,
Haozhe Ji,
Siyang Liu,
Xiaoyan Zhu,
Minlie Huang
Abstract:
Most of the existing pre-trained language representation models neglect to consider the linguistic knowledge of texts, which can promote language understanding in NLP tasks. To benefit the downstream tasks in sentiment analysis, we propose a novel language representation model called SentiLARE, which introduces word-level linguistic knowledge including part-of-speech tag and sentiment polarity (in…
▽ More
Most of the existing pre-trained language representation models neglect to consider the linguistic knowledge of texts, which can promote language understanding in NLP tasks. To benefit the downstream tasks in sentiment analysis, we propose a novel language representation model called SentiLARE, which introduces word-level linguistic knowledge including part-of-speech tag and sentiment polarity (inferred from SentiWordNet) into pre-trained models. We first propose a context-aware sentiment attention mechanism to acquire the sentiment polarity of each word with its part-of-speech tag by querying SentiWordNet. Then, we devise a new pre-training task called label-aware masked language model to construct knowledge-aware language representation. Experiments show that SentiLARE obtains new state-of-the-art performance on a variety of sentiment analysis tasks.
△ Less
Submitted 24 September, 2020; v1 submitted 6 November, 2019;
originally announced November 2019.
-
ARAML: A Stable Adversarial Training Framework for Text Generation
Authors:
Pei Ke,
Fei Huang,
Minlie Huang,
Xiaoyan Zhu
Abstract:
Most of the existing generative adversarial networks (GAN) for text generation suffer from the instability of reinforcement learning training algorithms such as policy gradient, leading to unstable performance. To tackle this problem, we propose a novel framework called Adversarial Reward Augmented Maximum Likelihood (ARAML). During adversarial training, the discriminator assigns rewards to sample…
▽ More
Most of the existing generative adversarial networks (GAN) for text generation suffer from the instability of reinforcement learning training algorithms such as policy gradient, leading to unstable performance. To tackle this problem, we propose a novel framework called Adversarial Reward Augmented Maximum Likelihood (ARAML). During adversarial training, the discriminator assigns rewards to samples which are acquired from a stationary distribution near the data rather than the generator's distribution. The generator is optimized with maximum likelihood estimation augmented by the discriminator's rewards instead of policy gradient. Experiments show that our model can outperform state-of-the-art text GANs with a more stable training process.
△ Less
Submitted 20 August, 2019;
originally announced August 2019.
-
On $k$-error linear complexity of pseudorandom binary sequences derived from Euler quotients
Authors:
Zhixiong Chen,
Vladimir Edemskiy,
Pinhui Ke,
Chenhuang Wu
Abstract:
We investigate the $k$-error linear complexity of pseudorandom binary sequences of period $p^{\mathfrak{r}}$ derived from the Euler quotients modulo $p^{\mathfrak{r}-1}$, a power of an odd prime $p$ for $\mathfrak{r}\geq 2$. When $\mathfrak{r}=2$, this is just the case of polynomial quotients (including Fermat quotients) modulo $p$, which has been studied in an earlier work of Chen, Niu and Wu. In…
▽ More
We investigate the $k$-error linear complexity of pseudorandom binary sequences of period $p^{\mathfrak{r}}$ derived from the Euler quotients modulo $p^{\mathfrak{r}-1}$, a power of an odd prime $p$ for $\mathfrak{r}\geq 2$. When $\mathfrak{r}=2$, this is just the case of polynomial quotients (including Fermat quotients) modulo $p$, which has been studied in an earlier work of Chen, Niu and Wu. In this work, we establish a recursive relation on the $k$-error linear complexity of the sequences for the case of $\mathfrak{r}\geq 3$. We also state the exact values of the $k$-error linear complexity for the case of $\mathfrak{r}=3$. From the results, we can find that the $k$-error linear complexity of the sequences (of period $p^{\mathfrak{r}}$) does not decrease dramatically for $k<p^{\mathfrak{r}-2}(p-1)^2/2$.
△ Less
Submitted 15 March, 2018; v1 submitted 8 March, 2018;
originally announced March 2018.
-
A further study on the linear complexity of new binary cyclotomic sequence of length $p^r$
Authors:
Zhifan Ye,
Pinhui Ke,
Chenhuang Wu
Abstract:
Recently, a conjecture on the linear complexity of a new class of generalized cyclotomic binary sequences of period $p^r$ was proposed by Z. Xiao et al. (Des. Codes Cryptogr., DOI 10.1007/s10623-017-0408-7). Later, for the case $f$ being the form $2^r$ with $r\ge 1$, Vladimir Edemskiy proved the conjecture (arXiv:1712.03947). In this paper, under the assumption of $2^{p-1} \not\equiv 1 \bmod p^2$…
▽ More
Recently, a conjecture on the linear complexity of a new class of generalized cyclotomic binary sequences of period $p^r$ was proposed by Z. Xiao et al. (Des. Codes Cryptogr., DOI 10.1007/s10623-017-0408-7). Later, for the case $f$ being the form $2^r$ with $r\ge 1$, Vladimir Edemskiy proved the conjecture (arXiv:1712.03947). In this paper, under the assumption of $2^{p-1} \not\equiv 1 \bmod p^2$ and $\gcd(\frac{p-1}{\rm {ord}_{p}(2)},f)=1$, the conjecture proposed by Z. Xiao et al. is proved for a general $f$ by using the Euler quotient. Actually, a generic construction of $p^r$-periodic binary sequence based on the generalized cyclotomy is introduced in this paper, which admits a flexible support set and includes Xiao's construction as a special case, and then an efficient method to compute the linear complexity of the sequence by the generic construction is presented, based on which the conjecture proposed by Z. Xiao et al. could be easily proved under the aforementioned assumption.
△ Less
Submitted 15 March, 2018; v1 submitted 24 December, 2017;
originally announced December 2017.
-
On error linear complexity of new generalized cyclotomic binary sequences of period $p^2$
Authors:
Chenhuang Wu,
Chunxiang Xu,
Zhixiong Chen,
Pinhui Ke
Abstract:
We consider the $k$-error linear complexity of a new binary sequence of period $p^2$, proposed in the recent paper "New generalized cyclotomic binary sequences of period $p^2$", by Z. Xiao et al., who calculated the linear complexity of the sequences (Designs, Codes and Cryptography, 2017, https://doi.org/10.1007/s10623-017-0408-7). More exactly, we determine the values of $k$-error linear complex…
▽ More
We consider the $k$-error linear complexity of a new binary sequence of period $p^2$, proposed in the recent paper "New generalized cyclotomic binary sequences of period $p^2$", by Z. Xiao et al., who calculated the linear complexity of the sequences (Designs, Codes and Cryptography, 2017, https://doi.org/10.1007/s10623-017-0408-7). More exactly, we determine the values of $k$-error linear complexity over $\mathbb{F}_2$ for almost $k>0$ in terms of the theory of Fermat quotients. Results indicate that such sequences have good stability.
△ Less
Submitted 22 April, 2018; v1 submitted 16 November, 2017;
originally announced November 2017.