Skip to main content

Showing 1–46 of 46 results for author: Ko, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.08103  [pdf, other

    cs.CL cs.FL

    Automata-based constraints for language model decoding

    Authors: Terry Koo, Frederick Liu, Luheng He

    Abstract: LMs are often expected to generate strings in some formal language; for example, structured data, API calls, or code snippets. Although LMs can be tuned to improve their adherence to formal syntax, this does not guarantee conformance, especially with smaller LMs suitable for large-scale deployment. In addition, tuning requires significant resources, making it impractical for uncommon or task-speci… ▽ More

    Submitted 11 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

    Comments: Accepted to CoLM 2024

  2. Learning Retrieval Augmentation for Personalized Dialogue Generation

    Authors: Qiushi Huang, Shuai Fu, Xubo Liu, Wenwu Wang, Tom Ko, Yu Zhang, Lilian Tang

    Abstract: Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the perso… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to EMNLP-2023

  3. arXiv:2406.18187  [pdf, other

    cs.CL cs.AI cs.LG

    Selective Prompting Tuning for Personalized Conversations with LLMs

    Authors: Qiushi Huang, Xubo Liu, Tom Ko, Bo Wu, Wenwu Wang, Yu Zhang, Lilian Tang

    Abstract: In conversational AI, personalizing dialogues with persona profiles and contextual understanding is essential. Despite large language models' (LLMs) improved response coherence, effective persona integration remains a challenge. In this work, we first study two common approaches for personalizing LLMs: textual prompting and direct fine-tuning. We observed that textual prompting often struggles to… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 findings

  4. "We Need Structured Output": Towards User-centered Constraints on Large Language Model Output

    Authors: Michael Xieyang Liu, Frederick Liu, Alexander J. Fiannaca, Terry Koo, Lucas Dixon, Michael Terry, Carrie J. Cai

    Abstract: Large language models can produce creative and diverse responses. However, to integrate them into current developer workflows, it is essential to constrain their outputs to follow specific formats or standards. In this work, we surveyed 51 experienced industry professionals to understand the range of scenarios and motivations driving the need for output constraints from a user-centered perspective… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Journal ref: "We Need Structured Output": Towards User-centered Constraints on LLM Output. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '24), May 11-16, 2024, Honolulu, HI, USA

  5. arXiv:2402.12647  [pdf, other

    cs.CV cs.RO

    DiffusionNOCS: Managing Symmetry and Uncertainty in Sim2Real Multi-Modal Category-level Pose Estimation

    Authors: Takuya Ikeda, Sergey Zakharov, Tianyi Ko, Muhammad Zubair Irshad, Robert Lee, Katherine Liu, Rares Ambrus, Koichi Nishiwaki

    Abstract: This paper addresses the challenging problem of category-level pose estimation. Current state-of-the-art methods for this task face challenges when dealing with symmetric objects and when attempting to generalize to new environments solely through synthetic data training. In this work, we address these challenges by proposing a probabilistic model that relies on diffusion to estimate dense canonic… ▽ More

    Submitted 5 March, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: 8 pages. 9 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  6. arXiv:2312.13585  [pdf, other

    cs.CL cs.SD eess.AS

    Speech Translation with Large Language Models: An Industrial Practice

    Authors: Zhichao Huang, Rong Ye, Tom Ko, Qianqian Dong, Shanbo Cheng, Mingxuan Wang, Hang Li

    Abstract: Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long au… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: Technical report. 13 pages. Demo: https://speechtranslation.github.io/llm-st/

  7. arXiv:2312.11804  [pdf, other

    cs.RO

    Gravity-aware Grasp Generation with Implicit Grasp Mode Selection for Underactuated Hands

    Authors: Tianyi Ko, Takuya Ikeda, Thomas Stewart, Robert Lee, Koichi Nishiwaki

    Abstract: Learning-based grasp detectors typically assume a precision grasp, where each finger only has one contact point, and estimate the grasp probability. In this work, we propose a data generation and learning pipeline that can leverage power grasping, which has more contact points with an enveloping configuration and is robust against both positioning error and force disturbance. To train a grasp dete… ▽ More

    Submitted 28 February, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  8. arXiv:2309.00169  [pdf, other

    eess.AS cs.LG cs.SD

    RepCodec: A Speech Representation Codec for Speech Tokenization

    Authors: Zhichao Huang, Chutong Meng, Tom Ko

    Abstract: With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing overall performance. To improve the performance of these discrete speech tokens, we present RepCodec, a novel speech representation codec for semantic speech token… ▽ More

    Submitted 6 June, 2024; v1 submitted 31 August, 2023; originally announced September 2023.

  9. arXiv:2306.11646  [pdf, other

    cs.CL eess.AS

    Recent Advances in Direct Speech-to-text Translation

    Authors: Chen Xu, Rong Ye, Qianqian Dong, Chengqi Zhao, Tom Ko, Mingxuan Wang, Tong Xiao, Jingbo Zhu

    Abstract: Recently, speech-to-text translation has attracted more and more attention and many studies have emerged rapidly. In this paper, we present a comprehensive survey on direct speech translation aiming to summarize the current state-of-the-art techniques. First, we categorize the existing research work into three directions based on the main challenges -- modeling burden, data scarcity, and applicati… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

    Comments: An expanded version of the paper accepted by IJCAI2023 survey track

  10. arXiv:2306.10493  [pdf, other

    cs.SD cs.CL eess.AS

    MOSPC: MOS Prediction Based on Pairwise Comparison

    Authors: Kexin Wang, Yunlong Zhao, Qianqian Dong, Tom Ko, Mingxuan Wang

    Abstract: As a subjective metric to evaluate the quality of synthesized speech, Mean opinion score~(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech… ▽ More

    Submitted 18 June, 2023; originally announced June 2023.

  11. arXiv:2306.02982  [pdf, other

    cs.CL eess.AS

    PolyVoice: Language Models for Speech to Speech Translation

    Authors: Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang

    Abstract: We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt… ▽ More

    Submitted 13 June, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

  12. arXiv:2305.17358  [pdf, other

    cs.CL

    CTC-based Non-autoregressive Speech Translation

    Authors: Chen Xu, Xiaoqian Liu, Xiaowen Liu, Qingxuan Sun, Yuhao Zhang, Murun Yang, Qianqian Dong, Tom Ko, Mingxuan Wang, Tong Xiao, Anxiang Ma, Jingbo Zhu

    Abstract: Combining end-to-end speech translation (ST) and non-autoregressive (NAR) generation is promising in language and speech processing for their advantages of less error propagation and low latency. In this paper, we investigate the potential of connectionist temporal classification (CTC) for non-autoregressive speech translation (NAST). In particular, we develop a model consisting of two encoders th… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Main Conference

  13. arXiv:2305.11411  [pdf, other

    cs.CL cs.SD eess.AS

    DUB: Discrete Unit Back-translation for Speech Translation

    Authors: Dong Zhang, Rong Ye, Tom Ko, Mingxuan Wang, Yaqian Zhou

    Abstract: How can speech-to-text translation (ST) perform as well as machine translation (MT)? The key point is to bridge the modality gap between speech and text so that useful MT techniques can be applied to ST. Recently, the approach of representing speech with unsupervised discrete units yields a new way to ease the modality problem. This motivates us to propose Discrete Unit Back-translation (DUB) to a… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted to Findings of ACL 2023

  14. arXiv:2303.17395  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

    Authors: Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang

    Abstract: The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approx… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: 12 pages

  15. Finding Heterophilic Neighbors via Confidence-based Subgraph Matching for Semi-supervised Node Classification

    Authors: Yoonhyuk Choi, Jiho Choi, Taewook Ko, Chong-Kwon Kim

    Abstract: Graph Neural Networks (GNNs) have proven to be powerful in many graph-based applications. However, they fail to generalize well under heterophilic setups, where neighbor nodes have different labels. To address this challenge, we employ a confidence ratio as a hyper-parameter, assuming that some of the edges are disassortative (heterophilic). Here, we propose a two-phased algorithm. Firstly, we det… ▽ More

    Submitted 12 April, 2023; v1 submitted 19 February, 2023; originally announced February 2023.

    Comments: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

  16. arXiv:2301.08918  [pdf, other

    cs.LG cs.SI

    Improving Signed Propagation of Graph Neural Network Under Multiple Classes

    Authors: Yoonhyuk Choi, Jiho Choi, Taewook Ko, Chong-Kwon Kim

    Abstract: Message-passing Graph Neural Networks (GNNs), which collect information from adjacent nodes achieve dismal performance on heterophilic graphs. Various schemes have been proposed to solve this problem, and propagating signed information on heterophilic edges has gained great attention. Recently, some works provided theoretical analysis that signed propagation always leads to performance improvement… ▽ More

    Submitted 18 June, 2024; v1 submitted 21 January, 2023; originally announced January 2023.

  17. arXiv:2301.05163  [pdf, other

    cs.LG cs.AI

    Signed Directed Graph Contrastive Learning with Laplacian Augmentation

    Authors: Taewook Ko, Yoonhyuk Choi, Chong-Kwon Kim

    Abstract: Graph contrastive learning has become a powerful technique for several graph mining tasks. It learns discriminative representation from different perspectives of augmented graphs. Ubiquitous in our daily life, singed-directed graphs are the most complex and tricky to analyze among various graph types. That is why singed-directed graph contrastive learning has not been studied much yet, while there… ▽ More

    Submitted 12 January, 2023; originally announced January 2023.

    Comments: Pre-prints

  18. arXiv:2212.03657  [pdf, other

    cs.CL cs.SD eess.AS

    M3ST: Mix at Three Levels for Speech Translation

    Authors: Xuxin Cheng, Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Yuexian Zou

    Abstract: How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  19. arXiv:2211.15398  [pdf, other

    cs.CV cs.LG

    Leveraging per Image-Token Consistency for Vision-Language Pre-training

    Authors: Yunhao Gou, Tom Ko, Hansi Yang, James Kwok, Yu Zhang, Mingxuan Wang

    Abstract: Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Under-uti… ▽ More

    Submitted 2 September, 2023; v1 submitted 20 November, 2022; originally announced November 2022.

    Comments: Accepted by CVPR 2023

  20. arXiv:2211.15081  [pdf, other

    cs.LG cs.AI

    Perturb Initial Features: Generalization of Neural Networks Under Sparse Features for Semi-supervised Node Classification

    Authors: Yoonhyuk Choi, Jiho Choi, Taewook Ko, Chong-Kwon Kim

    Abstract: Graph neural networks (GNNs) are commonly used in semi-supervised settings. Previous research has primarily focused on finding appropriate graph filters (e.g. aggregation methods) to perform well on both homophilic and heterophilic graphs. While these methods are effective, they can still suffer from the sparsity of node features, where the initial data contain few non-zero elements. This can lead… ▽ More

    Submitted 28 May, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

  21. arXiv:2210.16428  [pdf, other

    eess.AS cs.AI cs.MM cs.SD

    Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

    Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Haohe Liu, Qiuqiang Kong, Jianyuan Sun, Shengchen Li, Tom Ko, Yu Zhang, Lilian H. Tang, Mark D. Plumbley, Volkan Kılıç, Wenwu Wang

    Abstract: Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sound… ▽ More

    Submitted 28 May, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: INTERSPEECH 2023

  22. Personalized Dialogue Generation with Persona-Adaptive Attention

    Authors: Qiushi Huang, Yu Zhang, Tom Ko, Xubo Liu, Bo Wu, Wenwu Wang, Lilian Tang

    Abstract: Persona-based dialogue systems aim to generate consistent responses based on historical context and predefined persona. Unlike conventional dialogue generation, the persona-based dialogue needs to consider both dialogue context and persona, posing a challenge for coherent training. Specifically, this requires a delicate weight balance between context and persona. To achieve that, in this paper, we… ▽ More

    Submitted 9 January, 2024; v1 submitted 26 October, 2022; originally announced October 2022.

    Comments: 8 pages, 3 figures Accepted by AAAI-2023

  23. arXiv:2210.04062  [pdf, other

    cs.SD eess.AS

    CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

    Authors: Chutong Meng, Junyi Ao, Tom Ko, Mingxuan Wang, Haizhou Li

    Abstract: Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to a sequence of discrete codes, and perform code representation learning, where we predict the code representations based on a masked view of the original speech… ▽ More

    Submitted 5 July, 2023; v1 submitted 8 October, 2022; originally announced October 2022.

    Comments: Accepted by Interspeech 2023

  24. arXiv:2208.11511  [pdf, other

    cs.LG cs.AI

    A Graph Convolution for Signed Directed Graphs

    Authors: Taewook Ko, Chong-Kwon Kim

    Abstract: A signed directed graph is a graph with sign and direction information on the edges. Even though signed directed graphs are more informative than unsigned or undirected graphs, they are more complicated to analyze and have received less research attention. This paper investigates a spectral graph convolution model to fully utilize the information embedded in signed directed edges. We propose a nov… ▽ More

    Submitted 16 February, 2023; v1 submitted 22 August, 2022; originally announced August 2022.

    Comments: Preprint version

  25. arXiv:2208.02189  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis

    Authors: Qibing Bai, Tom Ko, Yu Zhang

    Abstract: In human speech, the attitude of a speaker cannot be fully expressed only by the textual content. It has to come along with the intonation. Declarative questions are commonly used in daily Cantonese conversations, and they are usually uttered with rising intonation. Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences due to the loss of… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: Accepted by INTERSPEECH 2022

  26. arXiv:2205.12756  [pdf, other

    cs.RO

    Development of a Stereo-Vision Based High-Throughput Robotic System for Mouse Tail Vein Injection

    Authors: Tianyi Ko, Koichi Nishiwaki, Koji Terada, Yusuke Tanaka, Shun Mitsumata, Ryuichi Katagiri, Taketo Junko, Naoshi Horiba, Hideyoshi Igata, Kazue Mizuno

    Abstract: In this paper, we present a robotic device for mouse tail vein injection. We propose a mouse holding mechanism to realize vein injection without anesthetizing the mouse, which consists of a tourniquet, vacuum port, and adaptive tail-end fixture. The position of the target vein in 3D space is reconstructed from a high-resolution stereo vision. The vein is detected by a simple but robust vein line d… ▽ More

    Submitted 25 May, 2022; originally announced May 2022.

    Comments: accepted to ICRA2022 (7 pages, 11 figures, 2 tables)

  27. arXiv:2205.11772  [pdf

    cs.CV

    Multi-Augmentation for Efficient Visual Representation Learning for Self-supervised Pre-training

    Authors: Van-Nhiem Tran, Chi-En Huang, Shen-Hsuan Liu, Kai-Lin Yang, Timothy Ko, Yung-Hui Li

    Abstract: In recent years, self-supervised learning has been studied to deal with the limitation of available labeled-dataset. Among the major components of self-supervised learning, the data augmentation pipeline is one key factor in enhancing the resulting performance. However, most researchers manually designed the augmentation pipeline, and the limited collections of transformation may cause the lack of… ▽ More

    Submitted 24 May, 2022; originally announced May 2022.

  28. arXiv:2205.08993  [pdf, other

    cs.CL eess.AS

    Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation

    Authors: Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Qibing Bai, Yu Zhang

    Abstract: Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently. The task is very challenging due to data scarcity and complex speech-to-speech mapping. In this paper, we report our recent achievements in S2ST. Firstly, we build a S2ST Transformer baseline which outperforms the original Translatotron. Secondly, we utilize the external data by pseudo-labeling and obtain a new… ▽ More

    Submitted 18 May, 2022; originally announced May 2022.

    Comments: Submitted to INTERSPEECH 2022

  29. arXiv:2204.03939  [pdf, ps, other

    cs.CL cs.SD eess.AS

    GigaST: A 10,000-hour Pseudo Speech Translation Corpus

    Authors: Rong Ye, Chengqi Zhao, Tom Ko, Chutong Meng, Tao Wang, Mingxuan Wang, Jun Cao

    Abstract: This paper introduces GigaST, a large-scale pseudo speech translation (ST) corpus. We create the corpus by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set is translated by human. ST models trained with an addition of our corpus obtain new state-of-the-art results on the MuST-C… ▽ More

    Submitted 6 June, 2023; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: Accepted at Interspeech 2023. GigaST dataset is available at https://st-benchmark.github.io/resources/GigaST

  30. arXiv:2203.17113  [pdf, other

    cs.SD cs.LG eess.AS

    Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

    Authors: Junyi Ao, Ziqiang Zhang, Long Zhou, Shujie Liu, Haizhou Li, Tom Ko, Lirong Dai, Jinyu Li, Yao Qian, Furu Wei

    Abstract: This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-decoder based automatic speech recognition (ASR). Within a multi-task learning framework, we introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes, derived from an offline clustering model. One is to predict the pseudo codes via masked language mode… ▽ More

    Submitted 20 June, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted by Interspeech 2022

  31. arXiv:2203.15610  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

    Authors: Rui Wang, Qibing Bai, Junyi Ao, Long Zhou, Zhixiang Xiong, Zhihua Wei, Yu Zhang, Tom Ko, Haizhou Li

    Abstract: Self-supervised speech representation learning has shown promising results in various speech processing tasks. However, the pre-trained models, e.g., HuBERT, are storage-intensive Transformers, limiting their scope of applications under low-resource settings. To this end, we propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically by pr… ▽ More

    Submitted 18 June, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: 5 pages, 2 figures, accepted to Insterspeech 2022

  32. arXiv:2203.10973  [pdf, ps, other

    cs.LG math.NA stat.ML

    A Local Convergence Theory for the Stochastic Gradient Descent Method in Non-Convex Optimization With Non-isolated Local Minima

    Authors: Taehee Ko, Xiantao Li

    Abstract: Loss functions with non-isolated minima have emerged in several machine learning problems, creating a gap between theory and practice. In this paper, we formulate a new type of local convexity condition that is suitable to describe the behavior of loss functions near non-isolated minima. We show that such condition is general enough to encompass many existing conditions. In addition we study the l… ▽ More

    Submitted 30 May, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

  33. Review-Based Domain Disentanglement without Duplicate Users or Contexts for Cross-Domain Recommendation

    Authors: Yoonhyuk Choi, Jiho Choi, Taewook Ko, Hyungho Byun, Chong-Kwon Kim

    Abstract: A cross-domain recommendation has shown promising results in solving data-sparsity and cold-start problems. Despite such progress, existing methods focus on domain-shareable information (overlapped users or same contexts) for a knowledge transfer, and they fail to generalize well without such requirements. To deal with these problems, we suggest utilizing review texts that are general to most e-co… ▽ More

    Submitted 12 April, 2023; v1 submitted 25 October, 2021; originally announced October 2021.

    Comments: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

  34. arXiv:2110.07205  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

    Authors: Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei

    Abstract: Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After prepro… ▽ More

    Submitted 24 May, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: Accepted by ACL 2022 main conference

  35. arXiv:2110.05036  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Multi-View Self-Attention Based Transformer for Speaker Recognition

    Authors: Rui Wang, Junyi Ao, Long Zhou, Shujie Liu, Zhihua Wei, Tom Ko, Qing Li, Yu Zhang

    Abstract: Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Tr… ▽ More

    Submitted 27 January, 2022; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: Paper to appear at ICASSP 2022

  36. arXiv:2108.02752  [pdf, other

    eess.AS cs.SD

    An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

    Authors: Xinhao Mei, Qiushi Huang, Xubo Liu, Gengyun Chen, Jingqian Wu, Yusong Wu, Jinzheng Zhao, Shengchen Li, Tom Ko, H Lilian Tang, Xi Shao, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related task or a large in-domain dataset is introduced t… ▽ More

    Submitted 5 August, 2021; originally announced August 2021.

    Comments: 5 pages, 1 figure, submitted to DCASE 2021 workshop

  37. arXiv:2107.09990  [pdf, other

    eess.AS cs.AI cs.SD

    CL4AC: A Contrastive Loss for Audio Captioning

    Authors: Xubo Liu, Qiushi Huang, Xinhao Mei, Tom Ko, H Lilian Tang, Mark D. Plumbley, Wenwu Wang

    Abstract: Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder architecture, where the audio signal is encoded in… ▽ More

    Submitted 22 November, 2021; v1 submitted 21 July, 2021; originally announced July 2021.

    Comments: The first two authors contributed equally, 5 pages, 3 figures, accepted by DCASE2021 Workshop

  38. Token-Level Supervised Contrastive Learning for Punctuation Restoration

    Authors: Qiushi Huang, Tom Ko, H Lilian Tang, Xubo Liu, Bo Wu

    Abstract: Punctuation is critical in understanding natural language text. Currently, most automatic speech recognition (ASR) systems do not generate punctuation, which affects the performance of downstream tasks, such as intent detection and slot filling. This gives rise to the need for punctuation restoration. Recent work in punctuation restoration heavily utilizes pre-trained language models without consi… ▽ More

    Submitted 23 September, 2021; v1 submitted 19 July, 2021; originally announced July 2021.

    Comments: 5 pages, 3 figures

  39. arXiv:2104.03815  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Machine Speech Chain for Domain Adaptation and Few-Shot Speaker Adaptation

    Authors: Fengpeng Yue, Yan Deng, Lei He, Tom Ko

    Abstract: Machine Speech Chain, which integrates both end-to-end (E2E) automatic speech recognition (ASR) and text-to-speech (TTS) into one circle for joint training, has been proven to be effective in data augmentation by leveraging large amounts of unpaired data. In this paper, we explore the TTS->ASR pipeline in speech chain to do domain adaptation for both neural TTS and E2E ASR models, with only text d… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

  40. arXiv:2104.00513  [pdf, other

    cs.SD cs.AI

    Auto-KWS 2021 Challenge: Task, Datasets, and Baselines

    Authors: Jingsong Wang, Yuxuan He, Chunyu Zhao, Qijie Shao, Wei-Wei Tu, Tom Ko, Hung-yi Lee, Lei Xie

    Abstract: Auto-KWS 2021 challenge calls for automated machine learning (AutoML) solutions to automate the process of applying machine learning to a customized keyword spotting task. Compared with other keyword spotting tasks, Auto-KWS challenge has the following three characteristics: 1) The challenge focuses on the problem of customized keyword spotting, where the target device can only be awakened by an e… ▽ More

    Submitted 31 March, 2021; originally announced April 2021.

    Comments: 5 pages, 2 figures

  41. arXiv:2010.13130  [pdf, ps, other

    cs.AI

    AutoSpeech 2020: The Second Automated Machine Learning Challenge for Speech Classification

    Authors: Jingsong Wang, Tom Ko, Zhen Xu, Xiawei Guo, Souxiang Liu, Wei-Wei Tu, Lei Xie

    Abstract: The AutoSpeech challenge calls for automated machine learning (AutoML) solutions to automate the process of applying machine learning to speech processing tasks. These tasks, which cover a large variety of domains, will be shown to the automated system in a random order. Each time when the tasks are switched, the information of the new task will be hinted with its corresponding training set. Thus,… ▽ More

    Submitted 25 October, 2020; originally announced October 2020.

    Comments: 5 pages, 2 figures, Details about AutoSpeech 2020 Challenge

  42. arXiv:2009.13735  [pdf, ps, other

    cs.CV

    MetaMix: Improved Meta-Learning with Interpolation-based Consistency Regularization

    Authors: Yangbin Chen, Yun Ma, Tom Ko, Jianping Wang, Qing Li

    Abstract: Model-Agnostic Meta-Learning (MAML) and its variants are popular few-shot classification methods. They train an initializer across a variety of sampled learning tasks (also known as episodes) such that the initialized model can adapt quickly to new tasks. However, current MAML-based algorithms have limitations in forming generalizable decision boundaries. In this paper, we propose an approach call… ▽ More

    Submitted 10 October, 2020; v1 submitted 28 September, 2020; originally announced September 2020.

    Comments: 8 pages, 3 figures, 3 tables. Accepted by 25th International Conference on Pattern Recognition (ICPR) 2020

  43. A Tendon-driven Robot Gripper with Passively Switchable Underactuated Surface and its Physics Simulation Based Parameter Optimization

    Authors: Tianyi Ko

    Abstract: In this paper, we propose a single-actuator gripper that can lift thin objects lying on a flat surface, in addition to the ability as a standard parallel gripper. The key is a crawler on the fingertip, which is underactuated together with other finger joints and switched with a passive and spring-loaded mechanism. While the idea of crawling finger is not a new one, this paper contributes to realiz… ▽ More

    Submitted 13 August, 2020; originally announced August 2020.

    Journal ref: IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 5002-5009, Oct. 2020

  44. arXiv:1812.10233  [pdf, ps, other

    cs.CL cs.IR

    An Investigation of Few-Shot Learning in Spoken Term Classification

    Authors: Yangbin Chen, Tom Ko, Lifeng Shang, Xiao Chen, Xin Jiang, Qing Li

    Abstract: In this paper, we investigate the feasibility of applying few-shot learning algorithms to a speech task. We formulate a user-defined scenario of spoken term classification as a few-shot learning problem. In most few-shot learning studies, it is assumed that all the N classes are new in a N-way problem. We suggest that this assumption can be relaxed and define a N+M-way problem where N and M are th… ▽ More

    Submitted 14 September, 2020; v1 submitted 26 December, 2018; originally announced December 2018.

    Comments: Accepted by INTERSPEECH 2020

  45. arXiv:1703.04929  [pdf, ps, other

    cs.CL

    SyntaxNet Models for the CoNLL 2017 Shared Task

    Authors: Chris Alberti, Daniel Andor, Ivan Bogatyy, Michael Collins, Dan Gillick, Lingpeng Kong, Terry Koo, Ji Ma, Mark Omernick, Slav Petrov, Chayut Thanapirom, Zora Tung, David Weiss

    Abstract: We describe a baseline dependency parsing system for the CoNLL2017 Shared Task. This system, which we call "ParseySaurus," uses the DRAGNN framework [Kong et al, 2017] to combine transition-based recurrent parsing and tagging with character-based word representations. On the v1.3 Universal Dependencies Treebanks, the new system outpeforms the publicly available, state-of-the-art "Parsey's Cousins"… ▽ More

    Submitted 15 March, 2017; originally announced March 2017.

    Comments: Tech report

  46. arXiv:1412.7449  [pdf, other

    cs.CL cs.LG stat.ML

    Grammar as a Foreign Language

    Authors: Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, Geoffrey Hinton

    Abstract: Syntactic constituency parsing is a fundamental problem in natural language processing and has been the subject of intensive research and engineering for decades. As a result, the most accurate parsers are domain specific, complex, and inefficient. In this paper we show that the domain agnostic attention-enhanced sequence-to-sequence model achieves state-of-the-art results on the most widely used… ▽ More

    Submitted 9 June, 2015; v1 submitted 23 December, 2014; originally announced December 2014.