-
CILDA: Contrastive Data Augmentation using Intermediate Layer Knowledge Distillation
Authors:
Md Akmal Haidar,
Mehdi Rezagholizadeh,
Abbas Ghaddar,
Khalil Bibi,
Philippe Langlais,
Pascal Poupart
Abstract:
Knowledge distillation (KD) is an efficient framework for compressing large-scale pre-trained language models. Recent years have seen a surge of research aiming to improve KD by leveraging Contrastive Learning, Intermediate Layer Distillation, Data Augmentation, and Adversarial Training. In this work, we propose a learning based data augmentation technique tailored for knowledge distillation, call…
▽ More
Knowledge distillation (KD) is an efficient framework for compressing large-scale pre-trained language models. Recent years have seen a surge of research aiming to improve KD by leveraging Contrastive Learning, Intermediate Layer Distillation, Data Augmentation, and Adversarial Training. In this work, we propose a learning based data augmentation technique tailored for knowledge distillation, called CILDA. To the best of our knowledge, this is the first time that intermediate layer representations of the main task are used in improving the quality of augmented samples. More precisely, we introduce an augmentation technique for KD based on intermediate layer matching using contrastive loss to improve masked adversarial data augmentation. CILDA outperforms existing state-of-the-art KD approaches on the GLUE benchmark, as well as in an out-of-domain evaluation.
△ Less
Submitted 15 April, 2022;
originally announced April 2022.
-
RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation
Authors:
Md Akmal Haidar,
Nithin Anchuri,
Mehdi Rezagholizadeh,
Abbas Ghaddar,
Philippe Langlais,
Pascal Poupart
Abstract:
Intermediate layer knowledge distillation (KD) can improve the standard KD technique (which only targets the output of teacher and student models) especially over large pre-trained language models. However, intermediate layer distillation suffers from excessive computational burdens and engineering efforts required for setting up a proper layer mapping. To address these problems, we propose a RAnd…
▽ More
Intermediate layer knowledge distillation (KD) can improve the standard KD technique (which only targets the output of teacher and student models) especially over large pre-trained language models. However, intermediate layer distillation suffers from excessive computational burdens and engineering efforts required for setting up a proper layer mapping. To address these problems, we propose a RAndom Intermediate Layer Knowledge Distillation (RAIL-KD) approach in which, intermediate layers from the teacher model are selected randomly to be distilled into the intermediate layers of the student model. This randomized selection enforce that: all teacher layers are taken into account in the training process, while reducing the computational cost of intermediate layer distillation. Also, we show that it act as a regularizer for improving the generalizability of the student model. We perform extensive experiments on GLUE tasks as well as on out-of-domain test sets. We show that our proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time.
△ Less
Submitted 1 October, 2021; v1 submitted 21 September, 2021;
originally announced September 2021.
-
Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative Adversarial Networks
Authors:
Md Akmal Haidar,
Mehdi Rezagholizadeh
Abstract:
Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored for low-resource ASR corpora. GANs help to learn the true data representation through a two-player min-max game. However, training an E2E ASR model using a large ASR corpus with a GAN framework has never been explored, because it might take excessively long time due to high-v…
▽ More
Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored for low-resource ASR corpora. GANs help to learn the true data representation through a two-player min-max game. However, training an E2E ASR model using a large ASR corpus with a GAN framework has never been explored, because it might take excessively long time due to high-variance gradient updates and face convergence issues. In this paper, we introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective where the ASR model acts as a generator and a discriminator tries to distinguish the ASR output from the real data. Since the ASR model is pre-trained, we hypothesize that the ASR model output (soft distribution vectors) helps to get higher scores from the discriminator and makes the task of the discriminator harder within our GAN framework, which in turn improves the performance of the ASR model in the fine-tuning stage. Here, the pre-trained ASR model is fine-tuned adversarially against the discriminator using an additional adversarial loss. Experiments on full LibriSpeech dataset show that our proposed approach outperforms baselines and conventional GAN-based adversarial models.
△ Less
Submitted 10 March, 2021;
originally announced March 2021.
-
Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation
Authors:
Md Akmal Haidar,
Chao Xing,
Mehdi Rezagholizadeh
Abstract:
End-to-end automatic speech recognition (ASR), unlike conventional ASR, does not have modules to learn the semantic representation from speech encoder. Moreover, the higher frame-rate of speech representation prevents the model to learn the semantic representation properly. Therefore, the models that are constructed by the lower frame-rate of speech encoder lead to better performance. For Transfor…
▽ More
End-to-end automatic speech recognition (ASR), unlike conventional ASR, does not have modules to learn the semantic representation from speech encoder. Moreover, the higher frame-rate of speech representation prevents the model to learn the semantic representation properly. Therefore, the models that are constructed by the lower frame-rate of speech encoder lead to better performance. For Transformer-based ASR, the lower frame-rate is not only important for learning better semantic representation but also for reducing the computational complexity due to the self-attention mechanism which has O(n^2) order of complexity in both training and inference. In this paper, we propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers in addition to traditional sub-sampling methods to input features that further reduce the frame-rate. This can help in reducing the computational cost of the self-attention process for training and inference with performance improvement. Moreover, we introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model. Experiments on LibriSpeech datasets show that our proposed methods outperform all other Transformer-based ASR systems. Furthermore, with language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models with just 30 million parameters trained without any external data.
△ Less
Submitted 17 March, 2021;
originally announced March 2021.
-
From Unsupervised Machine Translation To Adversarial Text Generation
Authors:
Ahmad Rashid,
Alan Do-Omri,
Md. Akmal Haidar,
Qun Liu,
Mehdi Rezagholizadeh
Abstract:
We present a self-attention based bilingual adversarial text generator (B-GAN) which can learn to generate text from the encoder representation of an unsupervised neural machine translation system. B-GAN is able to generate a distributed latent space representation which can be paired with an attention based decoder to generate fluent sentences. When trained on an encoder shared between two langua…
▽ More
We present a self-attention based bilingual adversarial text generator (B-GAN) which can learn to generate text from the encoder representation of an unsupervised neural machine translation system. B-GAN is able to generate a distributed latent space representation which can be paired with an attention based decoder to generate fluent sentences. When trained on an encoder shared between two languages and paired with the appropriate decoder, it can generate sentences in either language. B-GAN is trained using a combination of reconstruction loss for auto-encoder, a cross domain loss for translation and a GAN based adversarial loss for text generation. We demonstrate that B-GAN, trained on monolingual corpora only using multiple losses, generates more fluent sentences compared to monolingual baselines while effectively using half the number of parameters.
△ Less
Submitted 10 November, 2020;
originally announced November 2020.
-
A Simplified Fully Quantized Transformer for End-to-end Speech Recognition
Authors:
Alex Bie,
Bharat Venkitesh,
Joao Monteiro,
Md. Akmal Haidar,
Mehdi Rezagholizadeh
Abstract:
While significant improvements have been made in recent years in terms of end-to-end automatic speech recognition (ASR) performance, such improvements were obtained through the use of very large neural networks, unfit for embedded use on edge devices. That being said, in this paper, we work on simplifying and compressing Transformer-based encoder-decoder architectures for the end-to-end ASR task.…
▽ More
While significant improvements have been made in recent years in terms of end-to-end automatic speech recognition (ASR) performance, such improvements were obtained through the use of very large neural networks, unfit for embedded use on edge devices. That being said, in this paper, we work on simplifying and compressing Transformer-based encoder-decoder architectures for the end-to-end ASR task. We empirically introduce a more compact Speech-Transformer by investigating the impact of discarding particular modules on the performance of the model. Moreover, we evaluate reducing the numerical precision of our network's weights and activations while maintaining the performance of the full-precision model. Our experiments show that we can reduce the number of parameters of the full-precision model and then further compress the model 4x by fully quantizing to 8-bit fixed point precision.
△ Less
Submitted 24 March, 2020; v1 submitted 8 November, 2019;
originally announced November 2019.
-
Improving Word Embedding Factorization for Compression Using Distilled Nonlinear Neural Decomposition
Authors:
Vasileios Lioutas,
Ahmad Rashid,
Krtin Kumar,
Md Akmal Haidar,
Mehdi Rezagholizadeh
Abstract:
Word-embeddings are vital components of Natural Language Processing (NLP) models and have been extensively explored. However, they consume a lot of memory which poses a challenge for edge deployment. Embedding matrices, typically, contain most of the parameters for language models and about a third for machine translation systems. In this paper, we propose Distilled Embedding, an (input/output) em…
▽ More
Word-embeddings are vital components of Natural Language Processing (NLP) models and have been extensively explored. However, they consume a lot of memory which poses a challenge for edge deployment. Embedding matrices, typically, contain most of the parameters for language models and about a third for machine translation systems. In this paper, we propose Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition and knowledge distillation. First, we initialize the weights of our decomposed matrices by learning to reconstruct the full pre-trained word-embedding and then fine-tune end-to-end, employing knowledge distillation on the factorized embedding. We conduct extensive experiments with various compression rates on machine translation and language modeling, using different data-sets with a shared word-embedding matrix for both embedding and vocabulary projection matrices. We show that the proposed technique is simple to replicate, with one fixed parameter controlling compression size, has higher BLEU score on translation and lower perplexity on language modeling compared to complex, difficult to tune state-of-the-art methods.
△ Less
Submitted 10 November, 2020; v1 submitted 2 October, 2019;
originally announced October 2019.
-
TextKD-GAN: Text Generation using KnowledgeDistillation and Generative Adversarial Networks
Authors:
Md. Akmal Haidar,
Mehdi Rezagholizadeh
Abstract:
Text generation is of particular interest in many NLP applications such as machine translation, language modeling, and text summarization. Generative adversarial networks (GANs) achieved a remarkable success in high quality image generation in computer vision,and recently, GANs have gained lots of interest from the NLP community as well. However, achieving similar success in NLP would be more chal…
▽ More
Text generation is of particular interest in many NLP applications such as machine translation, language modeling, and text summarization. Generative adversarial networks (GANs) achieved a remarkable success in high quality image generation in computer vision,and recently, GANs have gained lots of interest from the NLP community as well. However, achieving similar success in NLP would be more challenging due to the discrete nature of text. In this work, we introduce a method using knowledge distillation to effectively exploit GAN setup for text generation. We demonstrate how autoencoders (AEs) can be used for providing a continuous representation of sentences, which is a smooth representation that assign non-zero probabilities to more than one word. We distill this representation to train the generator to synthesize similar smooth representations. We perform a number of experiments to validate our idea using different datasets and show that our proposed approach yields better performance in terms of the BLEU score and Jensen-Shannon distance (JSD) measure compared to traditional GAN-based text generation approaches without pre-training.
△ Less
Submitted 23 April, 2019;
originally announced May 2019.
-
Latent Code and Text-based Generative Adversarial Networks for Soft-text Generation
Authors:
Md. Akmal Haidar,
Mehdi Rezagholizadeh,
Alan Do-Omri,
Ahmad Rashid
Abstract:
Text generation with generative adversarial networks (GANs) can be divided into the text-based and code-based categories according to the type of signals used for discrimination. In this work, we introduce a novel text-based approach called Soft-GAN to effectively exploit GAN setup for text generation. We demonstrate how autoencoders (AEs) can be used for providing a continuous representation of s…
▽ More
Text generation with generative adversarial networks (GANs) can be divided into the text-based and code-based categories according to the type of signals used for discrimination. In this work, we introduce a novel text-based approach called Soft-GAN to effectively exploit GAN setup for text generation. We demonstrate how autoencoders (AEs) can be used for providing a continuous representation of sentences, which we will refer to as soft-text. This soft representation will be used in GAN discrimination to synthesize similar soft-texts. We also propose hybrid latent code and text-based GAN (LATEXT-GAN) approaches with one or more discriminators, in which a combination of the latent code and the soft-text is used for GAN discriminations. We perform a number of subjective and objective experiments on two well-known datasets (SNLI and Image COCO) to validate our techniques. We discuss the results using several evaluation metrics and show that the proposed techniques outperform the traditional GAN-based text-generation methods.
△ Less
Submitted 23 April, 2019; v1 submitted 15 April, 2019;
originally announced April 2019.
-
Bilingual-GAN: A Step Towards Parallel Text Generation
Authors:
Ahmad Rashid,
Alan Do-Omri,
Md. Akmal Haidar,
Qun Liu,
Mehdi Rezagholizadeh
Abstract:
Latent space based GAN methods and attention based sequence to sequence models have achieved impressive results in text generation and unsupervised machine translation respectively. Leveraging the two domains, we propose an adversarial latent space based model capable of generating parallel sentences in two languages concurrently and translating bidirectionally. The bilingual generation goal is ac…
▽ More
Latent space based GAN methods and attention based sequence to sequence models have achieved impressive results in text generation and unsupervised machine translation respectively. Leveraging the two domains, we propose an adversarial latent space based model capable of generating parallel sentences in two languages concurrently and translating bidirectionally. The bilingual generation goal is achieved by sampling from the latent space that is shared between both languages. First two denoising autoencoders are trained, with shared encoders and back-translation to enforce a shared latent state between the two languages. The decoder is shared for the two translation directions. Next, a GAN is trained to generate synthetic "code" mimicking the languages' shared latent space. This code is then fed into the decoder to generate text in either language. We perform our experiments on Europarl and Multi30k datasets, on the English-French language pair, and document our performance using both supervised and unsupervised machine translation.
△ Less
Submitted 14 May, 2019; v1 submitted 9 April, 2019;
originally announced April 2019.
-
SALSA-TEXT : self attentive latent space based adversarial text generation
Authors:
Jules Gagnon-Marchand,
Hamed Sadeghi,
Md. Akmal Haidar,
Mehdi Rezagholizadeh
Abstract:
Inspired by the success of self attention mechanism and Transformer architecture in sequence transduction and image generation applications, we propose novel self attention-based architectures to improve the performance of adversarial latent code- based schemes in text generation. Adversarial latent code-based text generation has recently gained a lot of attention due to their promising results. I…
▽ More
Inspired by the success of self attention mechanism and Transformer architecture in sequence transduction and image generation applications, we propose novel self attention-based architectures to improve the performance of adversarial latent code- based schemes in text generation. Adversarial latent code-based text generation has recently gained a lot of attention due to their promising results. In this paper, we take a step to fortify the architectures used in these setups, specifically AAE and ARAE. We benchmark two latent code-based methods (AAE and ARAE) designed based on adversarial setups. In our experiments, the Google sentence compression dataset is utilized to compare our method with these methods using various objective and subjective measures. The experiments demonstrate the proposed (self) attention-based models outperform the state-of-the-art in adversarial code-based text generation.
△ Less
Submitted 8 October, 2018; v1 submitted 28 September, 2018;
originally announced September 2018.