-
MaskCycleGAN-based Whisper to Normal Speech Conversion
Authors:
K. Rohith Gupta,
K. Ramnath,
S. Johanan Joysingh,
P. Vijayalakshmi,
T. Nagarajan
Abstract:
Whisper to normal speech conversion is an active area of research. Various architectures based on generative adversarial networks have been proposed in the recent past. Especially, recent study shows that MaskCycleGAN, which is a mask guided, and cyclic consistency keeping, generative adversarial network, performs really well for voice conversion from spectrogram representations. In the current wo…
▽ More
Whisper to normal speech conversion is an active area of research. Various architectures based on generative adversarial networks have been proposed in the recent past. Especially, recent study shows that MaskCycleGAN, which is a mask guided, and cyclic consistency keeping, generative adversarial network, performs really well for voice conversion from spectrogram representations. In the current work we present a MaskCycleGAN approach for the conversion of whispered speech to normal speech. We find that tuning the mask parameters, and pre-processing the signal with a voice activity detector provides superior performance when compared to the existing approach. The wTIMIT dataset is used for evaluation. Objective metrics such as PESQ and G-Loss are used to evaluate the converted speech, along with subjective evaluation using mean opinion score. The results show that the proposed approach offers considerable benefits.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More
Authors:
Zhichao Wang,
Bin Bi,
Shiva Kumar Pentyala,
Kiran Ramnath,
Sougata Chaudhuri,
Shubham Mehrotra,
Zixu,
Zhu,
Xiang-Bo Mao,
Sitaram Asur,
Na,
Cheng
Abstract:
With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired re…
▽ More
With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning
Authors:
Shiva Kumar Pentyala,
Zhichao Wang,
Bin Bi,
Kiran Ramnath,
Xiang-Bo Mao,
Regunathan Radhakrishnan,
Sitaram Asur,
Na,
Cheng
Abstract:
Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance.
This paper introduces PAFT, a new PArallel training pa…
▽ More
Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance.
This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning, which independently performs SFT and preference alignment (e.g., DPO and ORPO, etc.) with the same pre-trained model on respective datasets. The model produced by SFT and the model from preference alignment are then merged into a final model by parameter fusing for use in downstream applications. This work reveals important findings that preference alignment like DPO naturally results in a sparse model while SFT leads to a natural dense model which needs to be sparsified for effective model merging. This paper introduces an effective interference resolution which reduces the redundancy by sparsifying the delta parameters. The LLM resulted from the new training paradigm achieved Rank #1 on the HuggingFace Open LLM Leaderboard. Comprehensive evaluation shows the effectiveness of the parallel training paradigm.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
LatentDR: Improving Model Generalization Through Sample-Aware Latent Degradation and Restoration
Authors:
Ran Liu,
Sahil Khose,
Jingyun Xiao,
Lakshmi Sathidevi,
Keerthan Ramnath,
Zsolt Kira,
Eva L. Dyer
Abstract:
Despite significant advances in deep learning, models often struggle to generalize well to new, unseen domains, especially when training data is limited. To address this challenge, we propose a novel approach for distribution-aware latent augmentation that leverages the relationships across samples to guide the augmentation procedure. Our approach first degrades the samples stochastically in the l…
▽ More
Despite significant advances in deep learning, models often struggle to generalize well to new, unseen domains, especially when training data is limited. To address this challenge, we propose a novel approach for distribution-aware latent augmentation that leverages the relationships across samples to guide the augmentation procedure. Our approach first degrades the samples stochastically in the latent space, mapping them to augmented labels, and then restores the samples from their corrupted versions during training. This process confuses the classifier in the degradation step and restores the overall class distribution of the original samples, promoting diverse intra-class/cross-domain variability. We extensively evaluate our approach on a diverse set of datasets and tasks, including domain generalization benchmarks and medical imaging datasets with strong domain shift, where we show our approach achieves significant improvements over existing methods for latent space augmentation. We further show that our method can be flexibly adapted to long-tail recognition tasks, demonstrating its versatility in building more generalizable models. Code is available at https://github.com/nerdslab/LatentDR.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
Seeing is Knowing! Fact-based Visual Question Answering using Knowledge Graph Embeddings
Authors:
Kiran Ramnath,
Mark Hasegawa-Johnson
Abstract:
Fact-based Visual Question Answering (FVQA), a challenging variant of VQA, requires a QA-system to include facts from a diverse knowledge graph (KG) in its reasoning process to produce an answer. Large KGs, especially common-sense KGs, are known to be incomplete, i.e., not all non-existent facts are always incorrect. Therefore, being able to reason over incomplete KGs for QA is a critical requirem…
▽ More
Fact-based Visual Question Answering (FVQA), a challenging variant of VQA, requires a QA-system to include facts from a diverse knowledge graph (KG) in its reasoning process to produce an answer. Large KGs, especially common-sense KGs, are known to be incomplete, i.e., not all non-existent facts are always incorrect. Therefore, being able to reason over incomplete KGs for QA is a critical requirement in real-world applications that has not been addressed extensively in the literature. We develop a novel QA architecture that allows us to reason over incomplete KGs, something current FVQA state-of-the-art (SOTA) approaches lack due to their critical reliance on fact retrieval. We use KG Embeddings, a technique widely used for KG completion, for the downstream task of FVQA. We also employ a new image representation technique we call 'Image-as-Knowledge' to enable this capability, alongside a simple one-step CoAttention mechanism to attend to text and image during QA. Our FVQA architecture is faster during inference time, being O(m), as opposed to existing FVQA SOTA methods which are O(N log N), where m = number of vertices, N = number of edges = O(m^2). KG embeddings are shown to hold complementary information to word embeddings: a combination of both metrics permits performance comparable to SOTA methods in the standard answer retrieval task, and significantly better (26% absolute) in the proposed missing-edge reasoning task.
△ Less
Submitted 18 June, 2021; v1 submitted 31 December, 2020;
originally announced December 2020.