Search | arXiv e-print repository

UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation

Authors: Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Denvy Deng, Qi Zhang

Abstract: Large Language Models (LLMs) are popular for their impressive abilities, but the need for model-specific fine-tuning or task-specific prompt engineering can hinder their generalization. We propose UPRISE (Universal Prompt Retrieval for Improving zero-Shot Evaluation), which tunes a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input. Specifical… ▽ More Large Language Models (LLMs) are popular for their impressive abilities, but the need for model-specific fine-tuning or task-specific prompt engineering can hinder their generalization. We propose UPRISE (Universal Prompt Retrieval for Improving zero-Shot Evaluation), which tunes a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input. Specifically, we demonstrate universality in a cross-task and cross-model scenario: the retriever is tuned on a diverse set of tasks, but tested on unseen task types; we use a small frozen LLM, GPT-Neo-2.7B, for tuning the retriever, but test the retriever on different LLMs of much larger scales, such as BLOOM-7.1B, OPT-66B and GPT3-175B. Additionally, we show that UPRISE mitigates the hallucination problem in our experiments with ChatGPT, suggesting its potential to improve even the strongest LLMs. Our model and code are available at https://github.com/microsoft/LMOps. △ Less

Submitted 16 December, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: EMNLP 2023 Main Conference

arXiv:2303.07678 [pdf, other]

Query2doc: Query Expansion with Large Language Models

Authors: Liang Wang, Nan Yang, Furu Wei

Abstract: This paper introduces a simple yet effective query expansion approach, denoted as query2doc, to improve both sparse and dense retrieval systems. The proposed method first generates pseudo-documents by few-shot prompting large language models (LLMs), and then expands the query with generated pseudo-documents. LLMs are trained on web-scale text corpora and are adept at knowledge memorization. The ps… ▽ More This paper introduces a simple yet effective query expansion approach, denoted as query2doc, to improve both sparse and dense retrieval systems. The proposed method first generates pseudo-documents by few-shot prompting large language models (LLMs), and then expands the query with generated pseudo-documents. LLMs are trained on web-scale text corpora and are adept at knowledge memorization. The pseudo-documents from LLMs often contain highly relevant information that can aid in query disambiguation and guide the retrievers. Experimental results demonstrate that query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets, such as MS-MARCO and TREC DL, without any model fine-tuning. Furthermore, our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results. △ Less

Submitted 11 October, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

Comments: Accepted to EMNLP 2023

arXiv:2303.06604 [pdf, other]

doi 10.1103/PhysRevA.107.062411

Robust phase metrology with hybrid quantum interferometers against particle losses

Authors: X. N. Feng, D. He, L. F. Wei

Abstract: Entanglement is an important quantum resource to achieve high sensitive quantum metrology. However, the rapid decoherence of quantum entangled states, due to the unavoidable environment noise, result in practically the unwanted sharp drop of the measurement sensitivity. To overcome such a difficulty, here we propose a spin-oscillator hybrid quantum interferometer to achieve the desirable precise e… ▽ More Entanglement is an important quantum resource to achieve high sensitive quantum metrology. However, the rapid decoherence of quantum entangled states, due to the unavoidable environment noise, result in practically the unwanted sharp drop of the measurement sensitivity. To overcome such a difficulty, here we propose a spin-oscillator hybrid quantum interferometer to achieve the desirable precise estimation of the parameter encoded in the vibrations of the oscillator. Differing from the conventional two-mode quantum interferometers input by the two-mode NOON state or entangled coherent states (ECS), whose achievable sensitivities are strongly limited by the decoherence of the entangled vibrational states, we demonstrate that the present interferometer, input by a spin-dependent two-mode entangled state, possesses a manifest advantage, i.e., the measurement sensitivity of the estimated parameter is not influenced by the decoherence from the spin-oscillator entanglement. This is because that, by applying a spin-oscillator disentangled operation, the information of the estimated parameter encoded originally in the vibrational degrees can be effectively transferred into the spin degree and then can be sensitively estimated by the precise spin-state population measurements. As consequence, the proposed hybrid quantum interferometer possesses a manifest robustness against the particle losses of the vibrational modes. Interestingly, the achieved phase measurement sensitivity can still surpass the SQL obviously, even if relatively large number of particle loss occurs in one of the two modes. The potential application of the proposed spin-oscillator hybrid quantum interferometer is also discussed. △ Less

Submitted 12 March, 2023; originally announced March 2023.

arXiv:2303.03926 [pdf, other]

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Authors: Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei

Abstract: We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilitie… ▽ More We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available at \url{https://aka.ms/vallex}. △ Less

Submitted 7 March, 2023; originally announced March 2023.

Comments: We encourage readers to listen to the audio samples on our demo page: \url{https://aka.ms/vallex}

arXiv:2303.01421 [pdf, other]

Semiparametric Language Models Are Scalable Continual Learners

Authors: Guangyue Peng, Tao Ge, Si-Qing Chen, Furu Wei, Houfeng Wang

Abstract: Semiparametric language models (LMs) have shown promise in continuously learning from new text data by combining a parameterized neural LM with a growable non-parametric memory for memorizing new content. However, conventional semiparametric LMs will finally become prohibitive for computing and storing if they are applied to continual learning over streaming data, because the non-parametric memory… ▽ More Semiparametric language models (LMs) have shown promise in continuously learning from new text data by combining a parameterized neural LM with a growable non-parametric memory for memorizing new content. However, conventional semiparametric LMs will finally become prohibitive for computing and storing if they are applied to continual learning over streaming data, because the non-parametric memory grows linearly with the amount of data they learn from over time. To address the issue of scalability, we present a simple and intuitive approach called Selective Memorization (SeMem), which only memorizes difficult samples that the model is likely to struggle with. We demonstrate that SeMem improves the scalability of semiparametric LMs for continual learning over streaming data in two ways: (1) data-wise scalability: as the model becomes stronger through continual learning, it will encounter fewer difficult cases that need to be memorized, causing the growth of the non-parametric memory to slow down over time rather than growing at a linear rate with the size of training data; (2) model-wise scalability: SeMem allows a larger model to memorize fewer samples than its smaller counterpart because it is rarer for a larger model to encounter incomprehensible cases, resulting in a non-parametric memory that does not scale linearly with model size. We conduct extensive experiments in language modeling and downstream tasks to test SeMem's results, showing SeMem enables a semiparametric LM to be a scalable continual learner with little forgetting. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: Work in progress

arXiv:2303.00579 [pdf, other]

Are More Layers Beneficial to Graph Transformers?

Authors: Haiteng Zhao, Shuming Ma, Dongdong Zhang, Zhi-Hong Deng, Furu Wei

Abstract: Despite that going deep has proven successful in many neural architectures, the existing graph transformers are relatively shallow. In this work, we explore whether more layers are beneficial to graph transformers, and find that current graph transformers suffer from the bottleneck of improving performance by increasing depth. Our further analysis reveals the reason is that deep graph transformers… ▽ More Despite that going deep has proven successful in many neural architectures, the existing graph transformers are relatively shallow. In this work, we explore whether more layers are beneficial to graph transformers, and find that current graph transformers suffer from the bottleneck of improving performance by increasing depth. Our further analysis reveals the reason is that deep graph transformers are limited by the vanishing capacity of global attention, restricting the graph transformer from focusing on the critical substructure and obtaining expressive features. To this end, we propose a novel graph transformer model named DeepGraph that explicitly employs substructure tokens in the encoded representation, and applies local attention on related nodes to obtain substructure based attention encoding. Our model enhances the ability of the global attention to focus on substructures and promotes the expressiveness of the representations, addressing the limitation of self-attention as the graph transformer deepens. Experiments show that our method unblocks the depth limitation of graph transformers and results in state-of-the-art performance across various graph benchmarks with deeper models. △ Less

Submitted 1 March, 2023; originally announced March 2023.

Comments: ICLR 2023

arXiv:2302.14771 [pdf, other]

Generic-to-Specific Distillation of Masked Autoencoders

Authors: Wei Huang, Zhiliang Peng, Li Dong, Furu Wei, Jianbin Jiao, Qixiang Ye

Abstract: Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distilla… ▽ More Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are constrained to be consistent with those of the large model, to transfer task-specific features which guarantee task performance. With G2SD, the vanilla ViT-Small model respectively achieves 98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image classification, object detection, and semantic segmentation, setting a solid baseline for two-stage vision distillation. Code will be available at https://github.com/pengzhiliang/G2SD. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: Accepted by CVPR2023

arXiv:2302.14045 [pdf, other]

Language Is Not All You Need: Aligning Perception with Language Models

Authors: Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, Furu Wei

Abstract: A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal co… ▽ More A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs. △ Less

Submitted 1 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

arXiv:2302.12242 [pdf, other]

Side Adapter Network for Open-Vocabulary Semantic Segmentation

Authors: Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, Xiang Bai

Abstract: This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN). Our approach models the semantic segmentation task as a region recognition problem. A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is app… ▽ More This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN). Our approach models the semantic segmentation task as a region recognition problem. A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks. This decoupled design has the benefit CLIP in recognizing the class of mask proposals. Since the attached side network can reuse CLIP features, it can be very light. In addition, the entire network can be trained end-to-end, allowing the side network to be adapted to the frozen CLIP model, which makes the predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only adds a few additional trainable parameters. We evaluate our approach on multiple semantic segmentation benchmarks. Our method significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed. We hope our approach will serve as a solid baseline and help ease future research in open-vocabulary semantic segmentation. The code will be available at https://github.com/MendelXu/SAN. △ Less

Submitted 22 March, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

Comments: CVPR2023 Highlight

arXiv:2301.06825 [pdf, other]

doi 10.1007/978-3-031-30675-4_34

HanoiT: Enhancing Context-aware Translation via Selective Context

Authors: Jian Yang, Yuwei Yin, Shuming Ma, Liqun Yang, Hongcheng Guo, Haoyang Huang, Dongdong Zhang, Yutao Zeng, Zhoujun Li, Furu Wei

Abstract: Context-aware neural machine translation aims to use the document-level context to improve translation quality. However, not all words in the context are helpful. The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context. To mitigate this problem, we propose a novel end-to-end encoder-decoder mo… ▽ More Context-aware neural machine translation aims to use the document-level context to improve translation quality. However, not all words in the context are helpful. The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context. To mitigate this problem, we propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context. To verify the effectiveness of our method, extensive experiments and extra quantitative analysis are conducted on four document-level machine translation benchmarks. The experimental results demonstrate that our model significantly outperforms previous models on all datasets via the soft selection mechanism. △ Less

Submitted 17 January, 2023; originally announced January 2023.

arXiv:2301.02111 [pdf, other]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Authors: Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei

Abstract: We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training… ▽ More We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work. △ Less

Submitted 5 January, 2023; originally announced January 2023.

Comments: Working in progress

arXiv:2301.02010 [pdf, other]

HIT-SCIR at MMNLU-22: Consistency Regularization for Multilingual Spoken Language Understanding

Authors: Bo Zheng, Zhouyang Li, Fuxuan Wei, Qiguang Chen, Libo Qin, Wanxiang Che

Abstract: Multilingual spoken language understanding (SLU) consists of two sub-tasks, namely intent detection and slot filling. To improve the performance of these two sub-tasks, we propose to use consistency regularization based on a hybrid data augmentation strategy. The consistency regularization enforces the predicted distributions for an example and its semantically equivalent augmentation to be consis… ▽ More Multilingual spoken language understanding (SLU) consists of two sub-tasks, namely intent detection and slot filling. To improve the performance of these two sub-tasks, we propose to use consistency regularization based on a hybrid data augmentation strategy. The consistency regularization enforces the predicted distributions for an example and its semantically equivalent augmentation to be consistent. We conduct experiments on the MASSIVE dataset under both full-dataset and zero-shot settings. Experimental results demonstrate that our proposed method improves the performance on both intent detection and slot filling tasks. Our system\footnote{The code will be available at \url{https://github.com/bozheng-hit/MMNLU-22-HIT-SCIR}.} ranked 1st in the MMNLU-22 competition under the full-dataset setting. △ Less

Submitted 5 January, 2023; originally announced January 2023.

Comments: Accepted by EMNLP2022 MMNLU-22 Workshop. The winner of the MMNLU-22 Competition Full Dataset Task. Code is available at https://github.com/bozheng-hit/MMNLU-22-HIT-SCIR

arXiv:2301.01296 [pdf, other]

TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models

Authors: Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu

Abstract: Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different option… ▽ More Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM. △ Less

Submitted 3 January, 2023; originally announced January 2023.

Comments: Code is available at https://github.com/OliverRensu/TinyMIM

arXiv:2212.10923 [pdf, other]

Language Models as Inductive Reasoners

Authors: Zonglin Yang, Li Dong, Xinya Du, Hao Cheng, Erik Cambria, Xiaodong Liu, Jianfeng Gao, Furu Wei

Abstract: Inductive reasoning is a core component of human intelligence. In the past research of inductive reasoning within computer science, formal language is used as representations of knowledge (facts and rules, more specifically). However, formal language can cause systematic problems for inductive reasoning such as disability of handling raw input such as natural language, sensitiveness to mislabeled… ▽ More Inductive reasoning is a core component of human intelligence. In the past research of inductive reasoning within computer science, formal language is used as representations of knowledge (facts and rules, more specifically). However, formal language can cause systematic problems for inductive reasoning such as disability of handling raw input such as natural language, sensitiveness to mislabeled data, and incapacity to handle ambiguous input. To this end, we propose a new paradigm (task) for inductive reasoning, which is to induce natural language rules from natural language facts, and create a dataset termed DEER containing 1.2k rule-fact pairs for the task, where rules and facts are written in natural language. New automatic metrics are also proposed and analysed for the evaluation of this task. With DEER, we investigate a modern approach for inductive reasoning where we use natural language as representation for knowledge instead of formal language and use pretrained language models as ''reasoners''. Moreover, we provide the first and comprehensive analysis of how well pretrained language models can induce natural language rules from natural language facts. We also propose a new framework drawing insights from philosophy literature for this task, which we show in the experiment section that surpasses baselines in both automatic and human evaluations. We discuss about our future perspectives for inductive reasoning in Section 7. Dataset and code are available at https://github.com/ZonglinY/Inductive_Reasoning. △ Less

Submitted 5 February, 2024; v1 submitted 21 December, 2022; originally announced December 2022.

Comments: Accepted by EACL 2024

arXiv:2212.10559 [pdf, other]

Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers

Authors: Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, Furu Wei

Abstract: Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as meta-optimizers and understand in-context learning… ▽ More Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as meta-optimizers and understand in-context learning as implicit finetuning. Theoretically, we figure out that Transformer attention has a dual form of gradient descent. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. We comprehensively compare the behaviors of in-context learning and explicit finetuning on real tasks to provide empirical evidence that supports our understanding. Experimental results show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. Inspired by the dual form between Transformer attention and gradient descent, we design a momentum-based attention by analogy with gradient descent with momentum. The improved performance over vanilla attention further supports our understanding from another perspective, and more importantly, shows the potential to utilize our understanding for future model design. The code is available at \url{https://aka.ms/icl}. △ Less

Submitted 15 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: Accepted to ACL 2023 findings

arXiv:2212.10554 [pdf, other]

A Length-Extrapolatable Transformer

Authors: Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, Furu Wei

Abstract: Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention r… ▽ More Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention resolution. Moreover, we use blockwise causal attention during inference for better resolution. We evaluate different Transformer variants with language modeling. Experimental results show that our model achieves strong performance in both interpolation and extrapolation settings. The code will be available at https://aka.ms/LeX-Transformer. △ Less

Submitted 20 December, 2022; originally announced December 2022.

Comments: 9 pages

arXiv:2212.10218 [pdf, other]

GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator

Authors: Jian Yang, Shuming Ma, Li Dong, Shaohan Huang, Haoyang Huang, Yuwei Yin, Dongdong Zhang, Liqun Yang, Furu Wei, Zhoujun Li

Abstract: Pre-trained models have achieved remarkable success in natural language processing (NLP). However, existing pre-training methods underutilize the benefits of language understanding for generation. Inspired by the idea of Generative Adversarial Networks (GANs), we propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator, unifying the ability of language u… ▽ More Pre-trained models have achieved remarkable success in natural language processing (NLP). However, existing pre-training methods underutilize the benefits of language understanding for generation. Inspired by the idea of Generative Adversarial Networks (GANs), we propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator, unifying the ability of language understanding and generation in a single model. Our model, named as GanLM, is trained with two pre-training objectives: replaced token detection and replaced token denoising. Specifically, given masked source sentences, the generator outputs the target distribution and the discriminator predicts whether the target sampled tokens from distribution are incorrect. The target sentence is replaced with misclassified tokens to construct noisy previous context, which is used to generate the gold sentence. In general, both tasks improve the ability of language understanding and generation by selectively using the denoising data. Extensive experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models (PLMs) and achieves state-of-the-art performance. △ Less

Submitted 9 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: ACL 2023

arXiv:2212.10190 [pdf, other]

Pay Attention to Your Tone: Introducing a New Dataset for Polite Language Rewrite

Authors: Xun Wang, Tao Ge, Allen Mao, Yuki Li, Furu Wei, Si-Qing Chen

Abstract: We introduce \textsc{PoliteRewrite} -- a dataset for polite language rewrite which is a novel sentence rewrite task. Compared with previous text style transfer tasks that can be mostly addressed by slight token- or phrase-level edits, polite language rewrite requires deep understanding and extensive sentence-level edits over an offensive and impolite sentence to deliver the same message euphemisti… ▽ More We introduce \textsc{PoliteRewrite} -- a dataset for polite language rewrite which is a novel sentence rewrite task. Compared with previous text style transfer tasks that can be mostly addressed by slight token- or phrase-level edits, polite language rewrite requires deep understanding and extensive sentence-level edits over an offensive and impolite sentence to deliver the same message euphemistically and politely, which is more challenging -- not only for NLP models but also for human annotators to rewrite with effort. To alleviate the human effort for efficient annotation, we first propose a novel annotation paradigm by a collaboration of human annotators and GPT-3.5 to annotate \textsc{PoliteRewrite}. The released dataset has 10K polite sentence rewrites annotated collaboratively by GPT-3.5 and human, which can be used as gold standard for training, validation and test; and 100K high-quality polite sentence rewrites by GPT-3.5 without human review. We wish this work (The dataset (10K+100K) will be released soon) could contribute to the research on more challenging sentence rewrite, and provoke more thought in future on resource annotation paradigm with the help of the large-scaled pretrained models. △ Less

Submitted 20 December, 2022; originally announced December 2022.

arXiv:2212.09611 [pdf, other]

Optimizing Prompts for Text-to-Image Generation

Authors: Yaru Hao, Zewen Chi, Li Dong, Furu Wei

Abstract: Well-designed prompts can guide text-to-image models to generate amazing images. However, the performant prompts are often model-specific and misaligned with user input. Instead of laborious human engineering, we propose prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts. Specifically, we first perform supervised fine-tuning with a pretr… ▽ More Well-designed prompts can guide text-to-image models to generate amazing images. However, the performant prompts are often model-specific and misaligned with user input. Instead of laborious human engineering, we propose prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts. Specifically, we first perform supervised fine-tuning with a pretrained language model on a small collection of manually engineered prompts. Then we use reinforcement learning to explore better prompts. We define a reward function that encourages the policy to generate more aesthetically pleasing images while preserving the original user intentions. Experimental results on Stable Diffusion show that our method outperforms manual prompt engineering in terms of both automatic metrics and human preference ratings. Moreover, reinforcement learning further boosts performance, especially on out-of-domain prompts. The pretrained checkpoints are available at https://aka.ms/promptist. The demo can be found at https://aka.ms/promptist-demo. △ Less

Submitted 29 December, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

Comments: Accepted by NeurIPS-23

arXiv:2212.09058 [pdf, other]

BEATs: Audio Pre-Training with Acoustic Tokenizers

Authors: Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Furu Wei

Abstract: The massive growth of self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. While discrete label prediction is widely adopted for other modalities, the state-of-the-art audio SSL models still employ reconstruction loss for pre-training. Compared with reconstruction loss, semantic-rich discrete label prediction encourages the SSL m… ▽ More The massive growth of self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. While discrete label prediction is widely adopted for other modalities, the state-of-the-art audio SSL models still employ reconstruction loss for pre-training. Compared with reconstruction loss, semantic-rich discrete label prediction encourages the SSL model to abstract the high-level audio semantics and discard the redundant details as in human perception. However, a semantic-rich acoustic tokenizer for general audio pre-training is usually not straightforward to obtain, due to the continuous property of audio and unavailable phoneme sequences like speech. To tackle this challenge, we propose BEATs, an iterative audio pre-training framework to learn Bidirectional Encoder representation from Audio Transformers, where an acoustic tokenizer and an audio SSL model are optimized by iterations. In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner. Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model. The iteration is repeated with the hope of mutual promotion of the acoustic tokenizer and audio SSL model. The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art results across various audio classification benchmarks, even outperforming previous models that use more training data and model parameters significantly. Specifically, we set a new state-of-the-art mAP 50.6% on AudioSet-2M for audio-only models without using any external data, and 98.1% accuracy on ESC-50. The code and pre-trained models are available at https://aka.ms/beats. △ Less

Submitted 18 December, 2022; originally announced December 2022.

arXiv:2212.08653 [pdf, other]

Attentive Mask CLIP

Authors: Yifan Yang, Weiquan Huang, Yixuan Wei, Houwen Peng, Xinyang Jiang, Huiqiang Jiang, Fangyun Wei, Yin Wang, Han Hu, Lili Qiu, Yuqing Yang

Abstract: Image token removal is an efficient augmentation strategy for reducing the cost of computing image features. However, this efficient augmentation strategy has been found to adversely affect the accuracy of CLIP-based training. We hypothesize that removing a large portion of image tokens may improperly discard the semantic content associated with a given text description, thus constituting an incor… ▽ More Image token removal is an efficient augmentation strategy for reducing the cost of computing image features. However, this efficient augmentation strategy has been found to adversely affect the accuracy of CLIP-based training. We hypothesize that removing a large portion of image tokens may improperly discard the semantic content associated with a given text description, thus constituting an incorrect pairing target in CLIP training. To address this issue, we propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description. The correlation scores are computed in an online fashion using the EMA version of the visual encoder. Our experiments show that the proposed attentive masking approach performs better than the previous method of random token removal for CLIP training. The approach also makes it efficient to apply multiple augmentation views to the image, as well as introducing instance contrastive learning tasks between these views into the CLIP framework. Compared to other CLIP improvements that combine different pre-training targets such as SLIP and MaskCLIP, our method is not only more effective, but also much more efficient. Specifically, using ViT-B and YFCC-15M dataset, our approach achieves $43.9\%$ top-1 accuracy on ImageNet-1K zero-shot classification, as well as $62.7/42.1$ and $38.0/23.2$ I2T/T2I retrieval accuracy on Flickr30K and MS COCO, which are $+1.1\%$, $+5.5/+0.9$, and $+4.4/+1.3$ higher than the SLIP method, while being $2.30\times$ faster. An efficient version of our approach running $1.16\times$ faster than the plain CLIP model achieves significant gains of $+5.3\%$, $+11.3/+8.0$, and $+9.5/+4.9$ on these benchmarks. △ Less

Submitted 9 October, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 2771-2781

arXiv:2212.07752 [pdf, other]

Advancing Multilingual Pre-training: TRIP Triangular Document-level Pre-training for Multilingual Language Models

Authors: Hongyuan Lu, Haoyang Huang, Shuming Ma, Dongdong Zhang, Wai Lam, Furu Wei

Abstract: Despite the success of multilingual sequence-to-sequence pre-training, most existing approaches rely on document-level monolingual corpora in many different languages, sentence-level bilingual corpora,\footnote{In this paper, we use `bilingual corpora' to denote parallel corpora with `bilingual translation pairs' in many different language pairs, each consisting of two sentences/documents with the… ▽ More Despite the success of multilingual sequence-to-sequence pre-training, most existing approaches rely on document-level monolingual corpora in many different languages, sentence-level bilingual corpora,\footnote{In this paper, we use `bilingual corpora' to denote parallel corpora with `bilingual translation pairs' in many different language pairs, each consisting of two sentences/documents with the same meaning written in different languages. We use `trilingual corpora' to denote parallel corpora with `trilingual translation pairs' in many different language combinations, each consisting of three sentences/documents.} and sometimes synthetic document-level bilingual corpora. This hampers the performance with cross-lingual document-level tasks such as document-level translation. Therefore, we propose to mine and leverage document-level trilingual parallel corpora to improve sequence-to-sequence multilingual pre-training. We present \textbf{Tri}angular Document-level \textbf{P}re-training (\textbf{TRIP}), which is the first in the field to accelerate the conventional monolingual and bilingual objectives into a trilingual objective with a novel method called Grafting. Experiments show that TRIP achieves several strong state-of-the-art (SOTA) scores on three multilingual document-level machine translation benchmarks and one cross-lingual abstractive summarization benchmark, including consistent improvements by up to 3.11 d-BLEU points and 8.9 ROUGE-L points. △ Less

Submitted 13 May, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

arXiv:2212.06713 [pdf, other]

Structured Prompting: Scaling In-Context Learning to 1,000 Examples

Authors: Yaru Hao, Yutao Sun, Li Dong, Zhixiong Han, Yuxian Gu, Furu Wei

Abstract: Large language models have exhibited intriguing in-context learning capability, achieving promising zero- and few-shot performance without updating the parameters. However, conventional in-context learning is usually restricted by length constraints, rendering it ineffective to absorb supervision from a large number of examples. In order to go beyond few shots, we introduce structured prompting th… ▽ More Large language models have exhibited intriguing in-context learning capability, achieving promising zero- and few-shot performance without updating the parameters. However, conventional in-context learning is usually restricted by length constraints, rendering it ineffective to absorb supervision from a large number of examples. In order to go beyond few shots, we introduce structured prompting that breaks the length limit and scales in-context learning to thousands of examples. Specifically, demonstration examples are separately encoded with well-designed position embeddings, and then they are jointly attended by the test example using a rescaled attention mechanism. So we can scale the number of exemplars with linear complexity instead of quadratic complexity with respect to length. Experimental results on a diverse set of tasks show that our approach improves end-task performance and reduces evaluation variance over conventional in-context learning as the number of demonstration examples increases. Code has been released at https://aka.ms/structured-prompting. △ Less

Submitted 13 December, 2022; originally announced December 2022.

Comments: 14 pages

arXiv:2212.04257 [pdf, other]

Momentum Calibration for Text Generation

Authors: Xingxing Zhang, Yiran Liu, Xun Wang, Pengcheng He, Yang Yu, Si-Qing Chen, Wayne Xiong, Furu Wei

Abstract: The input and output of most text generation tasks can be transformed to two sequences of tokens and they can be modeled using sequence-to-sequence learning modeling tools such as Transformers. These models are usually trained by maximizing the likelihood the output text sequence and assumes the input sequence and all gold preceding tokens are given during training, while during inference the mode… ▽ More The input and output of most text generation tasks can be transformed to two sequences of tokens and they can be modeled using sequence-to-sequence learning modeling tools such as Transformers. These models are usually trained by maximizing the likelihood the output text sequence and assumes the input sequence and all gold preceding tokens are given during training, while during inference the model suffers from the exposure bias problem (i.e., it only has access to its previously predicted tokens rather gold tokens during beam search). In this paper, we propose MoCa ({\bf Mo}mentum {\bf Ca}libration) for text generation. MoCa is an online method that dynamically generates slowly evolving (but consistent) samples using a momentum moving average generator with beam search and MoCa learns to align its model scores of these samples with their actual qualities. Experiments on four text generation datasets (i.e., CNN/DailyMail, XSum, SAMSum and Gigaword) show MoCa consistently improves strong pre-trained transformers using vanilla fine-tuning and we achieve the state-of-the-art results on CNN/DailyMail and SAMSum datasets. △ Less

Submitted 8 December, 2022; originally announced December 2022.

arXiv:2212.03533 [pdf, other]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Authors: Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei

Abstract: This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clu… ▽ More This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters. △ Less

Submitted 22 February, 2024; v1 submitted 7 December, 2022; originally announced December 2022.

Comments: 17 pages, v2 fixes the SummEval numbers

arXiv:2212.00616 [pdf, other]

Extensible Prompts for Language Models on Zero-shot Language Style Customization

Authors: Tao Ge, Jing Hu, Li Dong, Shaoguang Mao, Yan Xia, Xun Wang, Si-Qing Chen, Furu Wei

Abstract: We propose eXtensible Prompt (X-Prompt) for prompting a large language model (LLM) beyond natural language (NL). X-Prompt instructs an LLM with not only NL but also an extensible vocabulary of imaginary words. Registering new imaginary words allows us to instruct the LLM to comprehend concepts that are difficult to describe with NL words, thereby making a prompt more descriptive. Also, these imagi… ▽ More We propose eXtensible Prompt (X-Prompt) for prompting a large language model (LLM) beyond natural language (NL). X-Prompt instructs an LLM with not only NL but also an extensible vocabulary of imaginary words. Registering new imaginary words allows us to instruct the LLM to comprehend concepts that are difficult to describe with NL words, thereby making a prompt more descriptive. Also, these imaginary words are designed to be out-of-distribution (OOD) robust so that they can be (re)used like NL words in various prompts, distinguishing X-Prompt from soft prompt that is for fitting in-distribution data. We propose context-augmented learning (CAL) to learn imaginary words for general usability, enabling them to work properly in OOD (unseen) prompts. We experiment X-Prompt for zero-shot language style customization as a case study. The promising results of X-Prompt demonstrate its potential to facilitate advanced interaction beyond the natural language interface, bridging the communication gap between humans and LLMs. △ Less

Submitted 30 November, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

Comments: Accepted by NeurIPS 2023

arXiv:2211.13184 [pdf, other]

TorchScale: Transformers at Scale

Authors: Shuming Ma, Hongyu Wang, Shaohan Huang, Wenhui Wang, Zewen Chi, Li Dong, Alon Benhaim, Barun Patra, Vishrav Chaudhary, Xia Song, Furu Wei

Abstract: Large Transformers have achieved state-of-the-art performance across many tasks. Most open-source libraries on scaling Transformers focus on improving training or inference with better parallelization. In this work, we present TorchScale, an open-source toolkit that allows researchers and developers to scale up Transformers efficiently and effectively. TorchScale has the implementation of several… ▽ More Large Transformers have achieved state-of-the-art performance across many tasks. Most open-source libraries on scaling Transformers focus on improving training or inference with better parallelization. In this work, we present TorchScale, an open-source toolkit that allows researchers and developers to scale up Transformers efficiently and effectively. TorchScale has the implementation of several modeling techniques, which can improve modeling generality and capability, as well as training stability and efficiency. Experimental results on language modeling and neural machine translation demonstrate that TorchScale can successfully scale Transformers to different sizes without tears. The library is available at https://aka.ms/torchscale. △ Less

Submitted 23 November, 2022; originally announced November 2022.

Comments: Work in progress

arXiv:2211.11275 [pdf, other]

doi 10.1109/TMM.2023.3275873

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Authors: Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, Lirong Dai, Daxin Jiang, Jinyu Li, Furu Wei

Abstract: Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech rep… ▽ More Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm. △ Less

Submitted 19 May, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: 11 pages, Accepted by IEEE Transactions on Multimedia

arXiv:2211.06722 [pdf, ps, other]

An Asymptotically Sharp Bound on the Maximum Number of Independent Transversals

Authors: Jake Ruotolo, Kevin Wang, Fan Wei

Abstract: Let $G$ be a multipartite graph with partition $V_1, V_2,\ldots, V_k$ of $V(G)$. Let $d_{i,j}$ denote the edge density of the pair $(V_i, V_j)$. An independent transversal is an independent set of $G$ with exactly one vertex in each $V_i$. In this paper, we prove an asymptotically sharp upper bound on the maximum number of independent transversals given the $d_{i,j}$'s. Let $G$ be a multipartite graph with partition $V_1, V_2,\ldots, V_k$ of $V(G)$. Let $d_{i,j}$ denote the edge density of the pair $(V_i, V_j)$. An independent transversal is an independent set of $G$ with exactly one vertex in each $V_i$. In this paper, we prove an asymptotically sharp upper bound on the maximum number of independent transversals given the $d_{i,j}$'s. △ Less

Submitted 31 January, 2024; v1 submitted 12 November, 2022; originally announced November 2022.

Comments: 15 pages, 1 figure

MSC Class: 05C35; 05C69

arXiv:2211.01837 [pdf, other]

Latent Prompt Tuning for Text Summarization

Authors: Yubo Zhang, Xingxing Zhang, Xun Wang, Si-qing Chen, Furu Wei

Abstract: Prompts with different control signals (e.g., length, keywords, etc.) can be used to control text summarization. When control signals are available, they can control the properties of generated summaries and potentially improve summarization quality (since more information are given). Unfortunately, control signals are not already available during inference time. In this paper, we propose Lotus (s… ▽ More Prompts with different control signals (e.g., length, keywords, etc.) can be used to control text summarization. When control signals are available, they can control the properties of generated summaries and potentially improve summarization quality (since more information are given). Unfortunately, control signals are not already available during inference time. In this paper, we propose Lotus (shorthand for Latent Prompt Tuning for Summarization), which is a single model that can be applied in both controlled and uncontrolled (without control signals) modes. During training, Lotus learns latent prompt representations from prompts with gold control signals using a contrastive learning objective. Experiments show Lotus in uncontrolled mode consistently improves upon strong (uncontrollable) summarization models across four different summarization datasets. We also demonstrate generated summaries can be controlled using prompts with user specified control tokens. △ Less

Submitted 19 December, 2022; v1 submitted 3 November, 2022; originally announced November 2022.

arXiv:2211.01367 [pdf, other]

Two-Stream Network for Sign Language Recognition and Translation

Authors: Yutong Chen, Ronglai Zuo, Fangyun Wei, Yu Wu, Shujie Liu, Brian Mak

Abstract: Sign languages are visual languages using manual articulations and non-manual elements to convey information. For sign language recognition and translation, the majority of existing approaches directly encode RGB videos into hidden representations. RGB videos, however, are raw signals with substantial visual redundancy, leading the encoder to overlook the key information for sign language understa… ▽ More Sign languages are visual languages using manual articulations and non-manual elements to convey information. For sign language recognition and translation, the majority of existing approaches directly encode RGB videos into hidden representations. RGB videos, however, are raw signals with substantial visual redundancy, leading the encoder to overlook the key information for sign language understanding. To mitigate this problem and better incorporate domain knowledge, such as handshape and body movement, we introduce a dual visual encoder containing two separate streams to model both the raw videos and the keypoint sequences generated by an off-the-shelf keypoint estimator. To make the two streams interact with each other, we explore a variety of techniques, including bidirectional lateral connection, sign pyramid network with auxiliary supervision, and frame-level self-distillation. The resulting model is called TwoStream-SLR, which is competent for sign language recognition (SLR). TwoStream-SLR is extended to a sign language translation (SLT) model, TwoStream-SLT, by simply attaching an extra translation network. Experimentally, our TwoStream-SLR and TwoStream-SLT achieve state-of-the-art performance on SLR and SLT tasks across a series of datasets including Phoenix-2014, Phoenix-2014T, and CSL-Daily. Code and models are available at: https://github.com/FangyunWei/SLRT. △ Less

Submitted 22 March, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

Comments: Accepted by NeurIPS 2022. Code and models are available at: https://github.com/FangyunWei/SLRT

arXiv:2210.17085 [pdf, ps, other]

doi 10.1016/j.chaos.2023.113224

Dynamics of an HIV/AIDS transmission model with protection awareness and fluctuations

Authors: Xuanpei Zhai, Wenshuang Li, Fengying Wei, Xuerong Mao

Abstract: We establish a stochastic HIV/AIDS model for the individuals with protection awareness and reveal how the protection awareness plays its important role in the control of AIDS. We firstly show that there exists a global positive solution for the stochastic model. By constructing Lyapunov functions, the ergodic stationary distribution when $R_{0}^{s}>1$ and the extinction when $R_{0}^{e}<1$ for the… ▽ More We establish a stochastic HIV/AIDS model for the individuals with protection awareness and reveal how the protection awareness plays its important role in the control of AIDS. We firstly show that there exists a global positive solution for the stochastic model. By constructing Lyapunov functions, the ergodic stationary distribution when $R_{0}^{s}>1$ and the extinction when $R_{0}^{e}<1$ for the stochastic model are obtained. A number of numerical simulations by using positive preserving truncated Euler-Maruyama method (PPTEM) are performed to illustrate the theoretical results. Our new results show that the detailed publicity has great impact on the control of AIDS compared with the extensive publicity, while the continuous antiretroviral therapy (ART) is helpful in the control of HIV/AIDS. △ Less

Submitted 1 November, 2022; v1 submitted 31 October, 2022; originally announced October 2022.

arXiv:2210.17027 [pdf, other]

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Authors: Kun Wei, Long Zhou, Ziqiang Zhang, Liping Chen, Shujie Liu, Lei He, Jinyu Li, Furu Wei

Abstract: Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare. To address this issue, we propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speec… ▽ More Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare. To address this issue, we propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks. By effectively leveraging the paired text data, Speech2S is capable of modeling the cross-lingual speech conversion from source to target language. We verify the performance of the proposed Speech2S on Europarl-ST and VoxPopuli datasets. Experimental results demonstrate that Speech2S gets an improvement of about 5 BLEU scores compared to encoder-only pre-training models, and achieves a competitive or even better performance than existing state-of-the-art models1. △ Less

Submitted 30 October, 2022; originally announced October 2022.

Comments: Submitted to ICASSP 2023

arXiv:2210.15461 [pdf, other]

LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation

Authors: Hongcheng Guo, Jiaheng Liu, Haoyang Huang, Jian Yang, Zhoujun Li, Dongdong Zhang, Zheng Cui, Furu Wei

Abstract: Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features, which has attracted considerable attention from both natural language processing and computer vision communities. Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases in the real world. In other wor… ▽ More Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features, which has attracted considerable attention from both natural language processing and computer vision communities. Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases in the real world. In other words, the multilingual multimodal machine translation (Multilingual MMT) task has not been investigated, which aims to handle the aforementioned issues by providing a shared semantic space for multiple languages. Besides, the image modality has no language boundaries, which is superior to bridging the semantic gap between languages. To this end, we first propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages. Then, an effective baseline LVP-M3 using visual prompts is proposed to support translations between different languages, which includes three stages (token encoding, language-aware visual prompt generation, and language translation). Extensive experimental results on our constructed benchmark datasets demonstrate the effectiveness of LVP-M3 method for Multilingual MMT. △ Less

Submitted 28 November, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

Comments: Accepted by EMNLP 2022

arXiv:2210.14867 [pdf, other]

Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning

Authors: Barun Patra, Saksham Singhal, Shaohan Huang, Zewen Chi, Li Dong, Furu Wei, Vishrav Chaudhary, Xia Song

Abstract: In this paper, we elaborate upon recipes for building multilingual representation models that are not only competitive with existing state-of-the-art models but are also more parameter efficient, thereby promoting better adoption in resource-constrained scenarios and practical applications. We show that going beyond English-centric bitexts, coupled with a novel sampling strategy aimed at reducing… ▽ More In this paper, we elaborate upon recipes for building multilingual representation models that are not only competitive with existing state-of-the-art models but are also more parameter efficient, thereby promoting better adoption in resource-constrained scenarios and practical applications. We show that going beyond English-centric bitexts, coupled with a novel sampling strategy aimed at reducing under-utilization of training data, substantially boosts performance across model sizes for both Electra and MLM pre-training objectives. We introduce XY-LENT: X-Y bitext enhanced Language ENcodings using Transformers which not only achieves state-of-the-art performance over 5 cross-lingual tasks within all model size bands, is also competitive across bands. Our XY-LENT XL variant outperforms XLM-RXXL and exhibits competitive performance with mT5 XXL while being 5x and 6x smaller respectively. We then show that our proposed method helps ameliorate the curse of multilinguality, with the XY-LENT XL achieving 99.3% GLUE performance and 98.5% SQuAD 2.0 performance compared to a SoTA English only model in the same size band. We then analyze our models performance on extremely low resource languages and posit that scaling alone may not be sufficient for improving the performance in this scenario △ Less

Submitted 26 October, 2022; originally announced October 2022.

Comments: Work in progress

arXiv:2210.10615 [pdf, other]

A Unified View of Masked Image Modeling

Authors: Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, Furu Wei

Abstract: Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which rec… ▽ More Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8% semantic segmentation mIoU metric on ADE20k (512 size). The code and pretrained models will be available at https://aka.ms/unimim. △ Less

Submitted 19 October, 2022; originally announced October 2022.

arXiv:2210.09304 [pdf, other]

Non-Contrastive Learning Meets Language-Image Pre-Training

Authors: Jinghao Zhou, Li Dong, Zhe Gan, Lijuan Wang, Furu Wei

Abstract: Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts. Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and craving for a large training batch size. In this work, we explore the validity of non-contrastive language-image pre-training (nCLIP), and study whether nice pro… ▽ More Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts. Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and craving for a large training batch size. In this work, we explore the validity of non-contrastive language-image pre-training (nCLIP), and study whether nice properties exhibited in visual self-supervised models can emerge. We empirically observe that the non-contrastive objective nourishes representation learning while sufficiently underperforming under zero-shot recognition. Based on the above study, we further introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics. The synergy between two objectives lets xCLIP enjoy the best of both worlds: superior performance in both zero-shot transfer and representation learning. Systematic evaluation is conducted spanning a wide variety of downstream tasks including zero-shot classification, out-of-domain classification, retrieval, visual representation learning, and textual representation learning, showcasing a consistent performance gain and validating the effectiveness of xCLIP. △ Less

Submitted 17 October, 2022; originally announced October 2022.

arXiv:2210.07022 [pdf, other]

CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual Labeled Sequence Translation

Authors: Jian Yang, Shaohan Huang, Shuming Ma, Yuwei Yin, Li Dong, Dongdong Zhang, Hongcheng Guo, Zhoujun Li, Furu Wei

Abstract: Named entity recognition (NER) suffers from the scarcity of annotated training data, especially for low-resource languages without labeled data. Cross-lingual NER has been proposed to alleviate this issue by transferring knowledge from high-resource languages to low-resource languages via aligned cross-lingual representations or machine translation results. However, the performance of cross-lingua… ▽ More Named entity recognition (NER) suffers from the scarcity of annotated training data, especially for low-resource languages without labeled data. Cross-lingual NER has been proposed to alleviate this issue by transferring knowledge from high-resource languages to low-resource languages via aligned cross-lingual representations or machine translation results. However, the performance of cross-lingual NER methods is severely affected by the unsatisfactory quality of translation or label projection. To address these problems, we propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER with the help of a multilingual labeled sequence translation model. Specifically, the target sequence is first translated into the source language and then tagged by a source NER model. We further adopt a labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence. Ultimately, the whole pipeline is integrated into an end-to-end model by the way of self-training. Experimental results on two benchmarks demonstrate that our method substantially outperforms the previous strong baseline by a large margin of +3~7 F1 scores and achieves state-of-the-art performance. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: 10 pages

arXiv:2210.06465 [pdf, other]

AniFaceGAN: Animatable 3D-Aware Face Image Generation for Video Avatars

Authors: Yue Wu, Yu Deng, Jiaolong Yang, Fangyun Wei, Qifeng Chen, Xin Tong

Abstract: Although 2D generative models have made great progress in face image generation and animation, they often suffer from undesirable artifacts such as 3D inconsistency when rendering images from different camera viewpoints. This prevents them from synthesizing video animations indistinguishable from real ones. Recently, 3D-aware GANs extend 2D GANs for explicit disentanglement of camera pose by lever… ▽ More Although 2D generative models have made great progress in face image generation and animation, they often suffer from undesirable artifacts such as 3D inconsistency when rendering images from different camera viewpoints. This prevents them from synthesizing video animations indistinguishable from real ones. Recently, 3D-aware GANs extend 2D GANs for explicit disentanglement of camera pose by leveraging 3D scene representations. These methods can well preserve the 3D consistency of the generated images across different views, yet they cannot achieve fine-grained control over other attributes, among which facial expression control is arguably the most useful and desirable for face animation. In this paper, we propose an animatable 3D-aware GAN for multiview consistent face animation generation. The key idea is to decompose the 3D representation of the 3D-aware GAN into a template field and a deformation field, where the former represents different identities with a canonical expression, and the latter characterizes expression variations of each identity. To achieve meaningful control over facial expressions via deformation, we propose a 3D-level imitative learning scheme between the generator and a parametric 3D face model during adversarial training of the 3D-aware GAN. This helps our method achieve high-quality animatable face image generation with strong visual 3D consistency, even though trained with only unstructured 2D images. Extensive experiments demonstrate our superior performance over prior works. Project page: https://yuewuhkust.github.io/AniFaceGAN △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted by NeurIPS 2022. Project Page: https://yuewuhkust.github.io/AniFaceGAN

arXiv:2210.06423 [pdf, other]

Foundation Transformers

Authors: Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, Furu Wei

Abstract: A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves… ▽ More A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3). △ Less

Submitted 19 October, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: Work in progress

arXiv:2210.03730 [pdf, other]

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

Authors: Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu, Lirong Dai, Jinyu Li, Furu Wei

Abstract: The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods. In this paper, we propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Leveraging hidden-unit as an interface to align speech and text, we can decomp… ▽ More The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods. In this paper, we propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Leveraging hidden-unit as an interface to align speech and text, we can decompose the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be jointly pre-trained with unpaired speech and text data respectively. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks. Experimental results show that SpeechUT gets substantial improvements over strong baselines, and achieves state-of-the-art performance on both the LibriSpeech ASR and MuST-C ST tasks. To better understand the proposed SpeechUT, detailed analyses are conducted. The code and pre-trained models are available at https://aka.ms/SpeechUT. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: 14 pages, accepted by EMNLP 2022

arXiv:2210.02849 [pdf, other]

XDoc: Unified Pre-training for Cross-Format Document Understanding

Authors: Jingye Chen, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei

Abstract: The surge of pre-training has witnessed the rapid development of document understanding recently. Pre-training and fine-tuning framework has been effectively used to tackle texts in various formats, including plain texts, document texts, and web texts. Despite achieving promising performance, existing pre-trained models usually target one specific document format at one time, making it difficult t… ▽ More The surge of pre-training has witnessed the rapid development of document understanding recently. Pre-training and fine-tuning framework has been effectively used to tackle texts in various formats, including plain texts, document texts, and web texts. Despite achieving promising performance, existing pre-trained models usually target one specific document format at one time, making it difficult to combine knowledge from multiple document formats. To address this, we propose XDoc, a unified pre-trained model which deals with different document formats in a single model. For parameter efficiency, we share backbone parameters for different formats such as the word embedding layer and the Transformer layers. Meanwhile, we introduce adaptive layers with lightweight parameters to enhance the distinction across different formats. Experimental results have demonstrated that with only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost effective for real-world deployment. The code and pre-trained models will be publicly available at \url{https://aka.ms/xdoc}. △ Less

Submitted 6 October, 2022; originally announced October 2022.

Comments: EMNLP 2022

arXiv:2209.15329 [pdf, other]

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Authors: Ziqiang Zhang, Sanyuan Chen, Long Zhou, Yu Wu, Shuo Ren, Shujie Liu, Zhuoyuan Yao, Xun Gong, Lirong Dai, Jinyu Li, Furu Wei

Abstract: How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discret… ▽ More How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM. △ Less

Submitted 15 June, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

Comments: We have corrected the errors in the pre-training data for SpeechLM-P Base models, new results are updated

arXiv:2209.13940 [pdf, other]

Revamping Multilingual Agreement Bidirectionally via Switched Back-translation for Multilingual Neural Machine Translation

Authors: Hongyuan Lu, Haoyang Huang, Shuming Ma, Dongdong Zhang, Furu Wei, Wai Lam

Abstract: Despite the fact that multilingual agreement (MA) has shown its importance for multilingual neural machine translation (MNMT), current methodologies in the field have two shortages: (i) require parallel data between multiple language pairs, which is not always realistic and (ii) optimize the agreement in an ambiguous direction, which hampers the translation performance. We present \textbf{B}idirec… ▽ More Despite the fact that multilingual agreement (MA) has shown its importance for multilingual neural machine translation (MNMT), current methodologies in the field have two shortages: (i) require parallel data between multiple language pairs, which is not always realistic and (ii) optimize the agreement in an ambiguous direction, which hampers the translation performance. We present \textbf{B}idirectional \textbf{M}ultilingual \textbf{A}greement via \textbf{S}witched \textbf{B}ack-\textbf{t}ranslation (\textbf{BMA-SBT}), a novel and universal multilingual agreement framework for fine-tuning pre-trained MNMT models, which (i) exempts the need for aforementioned parallel data by using a novel method called switched BT that creates synthetic text written in another source language using the translation target and (ii) optimizes the agreement bidirectionally with the Kullback-Leibler Divergence loss. Experiments indicate that BMA-SBT clearly improves the strong baselines on the task of MNMT with three benchmarks: TED Talks, News, and Europarl. In-depth analyzes indicate that BMA-SBT brings additive improvements to the conventional BT method. △ Less

Submitted 15 May, 2023; v1 submitted 28 September, 2022; originally announced September 2022.

arXiv:2209.07703 [pdf]

Analyses of Flight Time During Solar Proton Events and Solar Flares

Authors: X. H. Xu, Y. Wang, F. S. Wei, X. S. Feng, M. H. Bo, H. W. Tang, D. S. Wang, B. Lei, B. Y. Wang, P. B. Zuo, C. W. Jiang, X. J. Xu, Z. L. Zhou, Z. Li, P. Zou, L. D. Wang, Y. X. Gu, Y. L. Chen, W. Y. Zhang, P. Sun

Abstract: Analyzing the effects of space weather on aviation is a new and developing topic. It has been commonly accepted that the flight time of the polar flights may increase during solar proton events because the flights have to change their route to avoid the high-energy particles. However, apart from such phenomenon, researches related to the flight time during space weather events is very rare. Based… ▽ More Analyzing the effects of space weather on aviation is a new and developing topic. It has been commonly accepted that the flight time of the polar flights may increase during solar proton events because the flights have to change their route to avoid the high-energy particles. However, apart from such phenomenon, researches related to the flight time during space weather events is very rare. Based on the analyses of 39 representative international air routes around westerlies, it is found that 97.44% (94.87%) of the commercial airplanes on the westbound (eastbound) air routes reveal shorter (longer) flight time during solar proton events compared to those during quiet periods, and the averaged magnitude of change in flight time is ~10 min or 0.21%-4.17% of the total flight durations. Comparative investigations reassure the certainty of such phenomenon that the directional differences in flight time are still incontrovertible regardless of over-land routes (China-Europe) or over-sea routes (China-Western America). Further analyses suggest that the solar proton events associated atmospheric heating will change the flight durations by weakening certain atmospheric circulations, such as the polar jet stream. While the polar jet stream will not be obviously altered during solar flares so that the directional differences in flight time are not found. Besides the conventional space weather effects already known, this paper is the first report that indicates a distinct new scenario of how the solar proton events affect flight time. These analyses are also important for aviation since our discoveries could help the airways optimize the air routes to save passenger time costs, reduce fuel costs and even contribute to the global warming issues. △ Less

Submitted 15 September, 2022; originally announced September 2022.

Comments: submitted to Scientific Reports

arXiv:2209.07701 [pdf]

Characteristics of Flight Delays during Solar Flares

Authors: X. H. Xu, Y. Wang, F. S. Wei, X. S. Feng, M. H. Bo, H. W. Tang, D. S. Wang, L. Bian, B. Y. Wang, W. Y. Zhang, Y. S. Huang, Z. Li, J. P. Guo, P. B. Zuo, C. W. Jiang, X. J. Xu, Z. L. Zhou, P. Zou

Abstract: Solar flare is one of the severest solar activities on the sun, and it has many important impacts on the near-earth space. It has been found that flight arrival delays will increase during solar flare. However, the detailed intrinsic mechanism of how solar flares influence the delays is still unknown. Based on 5-years huge amount of flight data, here we comprehensively analyze the flight departure… ▽ More Solar flare is one of the severest solar activities on the sun, and it has many important impacts on the near-earth space. It has been found that flight arrival delays will increase during solar flare. However, the detailed intrinsic mechanism of how solar flares influence the delays is still unknown. Based on 5-years huge amount of flight data, here we comprehensively analyze the flight departure delays during 57 solar flares. It is found that the averaged flight departure delay time during solar flares increased by 20.68% (7.67 min) compared to those during quiet periods. It is also shown that solar flare related flight delays reveal apparent time and latitude dependencies. Flight delays during dayside solar flares are more serious than those during nightside flares, and the longer (shorter) delays tend to occur in the lower (higher) latitude airport. Further analyses suggest that flight delay time and delay rate would be directly modulated by the solar intensity (soft X-ray flux) and the Solar Zenith Angle. For the first time, these results indicate that the communication interferences caused by solar flares will directly affect flight departure delay time and delay rate. This work also expands our conventional understandings to the impacts of solar flares on human society, and it could also provide us with brand new views to help prevent or cope with flight delays. △ Less

Submitted 15 September, 2022; originally announced September 2022.

Comments: submitted to APJL

arXiv:2209.07700 [pdf]

The Effects of Space Weather on Flight Delays

Authors: Y. Wang, X. H. Xu, F. S. Wei, X. S. Feng, M. H. Bo, H. W. Tang, D. S. Wang, L. Bian, B. Y. Wang, W. Y. Zhang, Y. S. Huang, Z. Li, J. P. Guo, P. B. Zuo, C. W. Jiang, X. J. Xu, Z. L. Zhou, P. Zou

Abstract: Although the sun is really far away from us, some solar activities could still influence the performance and reliability of space-borne and ground-based technological systems on Earth. Those time-varying conditions in space caused by the sun are also called space weather, as the atmospheric conditions that can affect weather on the ground. It is known that aviation activities can be affected durin… ▽ More Although the sun is really far away from us, some solar activities could still influence the performance and reliability of space-borne and ground-based technological systems on Earth. Those time-varying conditions in space caused by the sun are also called space weather, as the atmospheric conditions that can affect weather on the ground. It is known that aviation activities can be affected during space weather events, but the exact effects of space weather on aviation are still unclear. Especially how the flight delays, the top topic concerned by most people, will be affected by space weather has never been thoroughly researched. By analyzing huge amount of flight data (~5X106 records), for the first time, we demonstrate that space weather events could have systematically modulating effects on flight delays. The average arrival delay time and 30-minute delay rate during space weather events are significantly increased by 81.34% and 21.45% respectively compared to those during quiet periods. The evident negative correlation between the yearly flight regularity rate and the yearly mean total sunspot number during 22 years also confirms such delay effects. Further studies indicate that the interference in communication and navigation caused by geomagnetic field fluctuations and ionospheric disturbances associated with the space weather events will increase the flight delay time and delay rate. These results expand the traditional field of space weather research and could also provide us with brand new views for improving the flight delay predications. △ Less

Submitted 15 September, 2022; originally announced September 2022.

Comments: submitted to science advances

arXiv:2209.04260 [pdf, other]

doi 10.1103/PhysRevD.106.063026

Search for relativistic fractionally charged particles in space

Authors: DAMPE Collaboration, F. Alemanno, C. Altomare, Q. An, P. Azzarello, F. C. T. Barbato, P. Bernardini, X. J. Bi, M. S. Cai, E. Casilli, E. Catanzani, J. Chang, D. Y. Chen, J. L. Chen, Z. F. Chen, M. Y. Cui, T. S. Cui, Y. X. Cui, H. T. Dai, A. De-Benedittis, I. De Mitri, F. de Palma, M. Deliyergiyev, A. Di Giovanni, M. Di Santo , et al. (126 additional authors not shown)

Abstract: More than a century after the performance of the oil drop experiment, the possible existence of fractionally charged particles FCP still remains unsettled. The search for FCPs is crucial for some extensions of the Standard Model in particle physics. Most of the previously conducted searches for FCPs in cosmic rays were based on experiments underground or at high altitudes. However, there have been… ▽ More More than a century after the performance of the oil drop experiment, the possible existence of fractionally charged particles FCP still remains unsettled. The search for FCPs is crucial for some extensions of the Standard Model in particle physics. Most of the previously conducted searches for FCPs in cosmic rays were based on experiments underground or at high altitudes. However, there have been few searches for FCPs in cosmic rays carried out in orbit other than AMS-01 flown by a space shuttle and BESS by a balloon at the top of the atmosphere. In this study, we conduct an FCP search in space based on on-orbit data obtained using the DArk Matter Particle Explorer (DAMPE) satellite over a period of five years. Unlike underground experiments, which require an FCP energy of the order of hundreds of GeV, our FCP search starts at only a few GeV. An upper limit of $6.2\times 10^{-10}~~\mathrm{cm^{-2}sr^{-1} s^{-1}}$ is obtained for the flux. Our results demonstrate that DAMPE exhibits higher sensitivity than experiments of similar types by three orders of magnitude that more stringently restricts the conditions for the existence of FCP in primary cosmic rays. △ Less

Submitted 9 September, 2022; originally announced September 2022.

Comments: 19 pages, 6 figures, accepted by PRD

Report number: 106, 063026

Journal ref: Physical Review D 106.6 (2022): 063026

arXiv:2208.10442 [pdf, other]

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Authors: Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, Furu Wei

Abstract: A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Mult… ▽ More A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO). △ Less

Submitted 30 August, 2022; v1 submitted 22 August, 2022; originally announced August 2022.

Comments: 18 pages

arXiv:2208.09595 [pdf, ps, other]

The Saddle-Point Accountant for Differential Privacy

Authors: Wael Alghamdi, Shahab Asoodeh, Flavio P. Calmon, Juan Felipe Gomez, Oliver Kosut, Lalitha Sankar, Fei Wei

Abstract: We introduce a new differential privacy (DP) accountant called the saddle-point accountant (SPA). SPA approximates privacy guarantees for the composition of DP mechanisms in an accurate and fast manner. Our approach is inspired by the saddle-point method -- a ubiquitous numerical technique in statistics. We prove rigorous performance guarantees by deriving upper and lower bounds for the approximat… ▽ More We introduce a new differential privacy (DP) accountant called the saddle-point accountant (SPA). SPA approximates privacy guarantees for the composition of DP mechanisms in an accurate and fast manner. Our approach is inspired by the saddle-point method -- a ubiquitous numerical technique in statistics. We prove rigorous performance guarantees by deriving upper and lower bounds for the approximation error offered by SPA. The crux of SPA is a combination of large-deviation methods with central limit theorems, which we derive via exponentially tilting the privacy loss random variables corresponding to the DP mechanisms. One key advantage of SPA is that it runs in constant time for the $n$-fold composition of a privacy mechanism. Numerical experiments demonstrate that SPA achieves comparable accuracy to state-of-the-art accounting methods with a faster runtime. △ Less

Submitted 19 August, 2022; originally announced August 2022.

Comments: 31 pages, 4 figures

Showing 151–200 of 713 results for author: Wei, F