Search | arXiv e-print repository

Unlocking Potential in Pre-Trained Music Language Models for Versatile Multi-Track Music Arrangement

Authors: Longshen Ou, Jingwei Zhao, Ziyu Wang, Gus Xia, Ye Wang

Abstract: Large language models have shown significant capabilities across various domains, including symbolic music generation. However, leveraging these pre-trained models for controllable music arrangement tasks, each requiring different forms of musical information as control, remains a novel challenge. In this paper, we propose a unified sequence-to-sequence framework that enables the fine-tuning of a… ▽ More Large language models have shown significant capabilities across various domains, including symbolic music generation. However, leveraging these pre-trained models for controllable music arrangement tasks, each requiring different forms of musical information as control, remains a novel challenge. In this paper, we propose a unified sequence-to-sequence framework that enables the fine-tuning of a symbolic music language model for multiple multi-track arrangement tasks, including band arrangement, piano reduction, drum arrangement, and voice separation. Our experiments demonstrate that the proposed approach consistently achieves higher musical quality compared to task-specific baselines across all four tasks. Furthermore, through additional experiments on probing analysis, we show the pre-training phase equips the model with essential knowledge to understand musical conditions, which is hard to acquired solely through task-specific fine-tuning. △ Less

Submitted 27 August, 2024; originally announced August 2024.

Comments: Submitted to AAAI 2025

arXiv:2408.04249 [pdf, other]

InstantStyleGaussian: Efficient Art Style Transfer with 3D Gaussian Splatting

Authors: Xin-Yi Yu, Jun-Xin Yu, Li-Bo Zhou, Yan Wei, Lin-Lin Ou

Abstract: We present InstantStyleGaussian, an innovative 3D style transfer method based on the 3D Gaussian Splatting (3DGS) scene representation. By inputting a target-style image, it quickly generates new 3D GS scenes. Our method operates on pre-reconstructed GS scenes, combining diffusion models with an improved iterative dataset update strategy. It utilizes diffusion models to generate target style image… ▽ More We present InstantStyleGaussian, an innovative 3D style transfer method based on the 3D Gaussian Splatting (3DGS) scene representation. By inputting a target-style image, it quickly generates new 3D GS scenes. Our method operates on pre-reconstructed GS scenes, combining diffusion models with an improved iterative dataset update strategy. It utilizes diffusion models to generate target style images, adds these new images to the training dataset, and uses this dataset to iteratively update and optimize the GS scenes, significantly accelerating the style editing process while ensuring the quality of the generated scenes. Extensive experimental results demonstrate that our method ensures high-quality stylized scenes while offering significant advantages in style transfer speed and consistency. △ Less

Submitted 26 August, 2024; v1 submitted 8 August, 2024; originally announced August 2024.

arXiv:2408.00294 [pdf, other]

RDP: Ranked Differential Privacy for Facial Feature Protection in Multiscale Sparsified Subspace

Authors: Lu Ou, Shaolin Liao, Shihui Gao, Guandong Huang, Zheng Qi

Abstract: With the widespread sharing of personal face images in applications' public databases, face recognition systems faces real threat of being breached by potential adversaries who are able to access users' face images and use them to intrude the face recognition systems. In this paper, we propose a novel privacy protection method in the multiscale sparsified feature subspaces to protect sensitive fac… ▽ More With the widespread sharing of personal face images in applications' public databases, face recognition systems faces real threat of being breached by potential adversaries who are able to access users' face images and use them to intrude the face recognition systems. In this paper, we propose a novel privacy protection method in the multiscale sparsified feature subspaces to protect sensitive facial features, by taking care of the influence or weight ranked feature coefficients on the privacy budget, named "Ranked Differential Privacy (RDP)". After the multiscale feature decomposition, the lightweight Laplacian noise is added to the dimension-reduced sparsified feature coefficients according to the geometric superposition method. Then, we rigorously prove that the RDP satisfies Differential Privacy. After that, the nonlinear Lagrange Multiplier (LM) method is formulated for the constraint optimization problem of maximizing the utility of the visualization quality protected face images with sanitizing noise, under a given facial features privacy budget. Then, two methods are proposed to solve the nonlinear LM problem and obtain the optimal noise scale parameters: 1) the analytical Normalization Approximation (NA) method with identical average noise scale parameter for real-time online applications; and 2) the LM optimization Gradient Descent (LMGD) numerical method to obtain the nonlinear solution through iterative updating for more accurate offline applications. Experimental results on two real-world datasets show that our proposed RDP outperforms other state-of-the-art methods: at a privacy budget of 0.2, the PSNR (Peak Signal-to-Noise Ratio) of the RDP is about ~10 dB higher than (10 times as high as) the highest PSNR of all compared methods. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: 13 pages, 6 figures

arXiv:2403.01214 [pdf, other]

Boosting Box-supervised Instance Segmentation with Pseudo Depth

Authors: Xinyi Yu, Ling Yan, Pengtao Jiang, Hao Chen, Bo Li, Lin Yuanbo Wu, Linlin Ou

Abstract: The realm of Weakly Supervised Instance Segmentation (WSIS) under box supervision has garnered substantial attention, showcasing remarkable advancements in recent years. However, the limitations of box supervision become apparent in its inability to furnish effective information for distinguishing foreground from background within the specified target box. This research addresses this challenge by… ▽ More The realm of Weakly Supervised Instance Segmentation (WSIS) under box supervision has garnered substantial attention, showcasing remarkable advancements in recent years. However, the limitations of box supervision become apparent in its inability to furnish effective information for distinguishing foreground from background within the specified target box. This research addresses this challenge by introducing pseudo-depth maps into the training process of the instance segmentation network, thereby boosting its performance by capturing depth differences between instances. These pseudo-depth maps are generated using a readily available depth predictor and are not necessary during the inference stage. To enable the network to discern depth features when predicting masks, we integrate a depth prediction layer into the mask prediction head. This innovative approach empowers the network to simultaneously predict masks and depth, enhancing its ability to capture nuanced depth-related information during the instance segmentation process. We further utilize the mask generated in the training process as supervision to distinguish the foreground from the background. When selecting the best mask for each box through the Hungarian algorithm, we use depth consistency as one calculation cost item. The proposed method achieves significant improvements on Cityscapes and COCO dataset. △ Less

Submitted 2 March, 2024; originally announced March 2024.

arXiv:2309.09739 [pdf, other]

Improving Neural Indoor Surface Reconstruction with Mask-Guided Adaptive Consistency Constraints

Authors: Xinyi Yu, Liqin Lu, Jintao Rong, Guangkai Xu, Linlin Ou

Abstract: 3D scene reconstruction from 2D images has been a long-standing task. Instead of estimating per-frame depth maps and fusing them in 3D, recent research leverages the neural implicit surface as a unified representation for 3D reconstruction. Equipped with data-driven pre-trained geometric cues, these methods have demonstrated promising performance. However, inaccurate prior estimation, which is usu… ▽ More 3D scene reconstruction from 2D images has been a long-standing task. Instead of estimating per-frame depth maps and fusing them in 3D, recent research leverages the neural implicit surface as a unified representation for 3D reconstruction. Equipped with data-driven pre-trained geometric cues, these methods have demonstrated promising performance. However, inaccurate prior estimation, which is usually inevitable, can lead to suboptimal reconstruction quality, particularly in some geometrically complex regions. In this paper, we propose a two-stage training process, decouple view-dependent and view-independent colors, and leverage two novel consistency constraints to enhance detail reconstruction performance without requiring extra priors. Additionally, we introduce an essential mask scheme to adaptively influence the selection of supervision constraints, thereby improving performance in a self-supervised paradigm. Experiments on synthetic and real-world datasets show the capability of reducing the interference from prior estimation errors and achieving high-quality scene reconstruction with rich geometric details. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2307.08300 [pdf, other]

ShiftNAS: Improving One-shot NAS via Probability Shift

Authors: Mingyang Zhang, Xinyi Yu, Haodong Zhao, Linlin Ou

Abstract: One-shot Neural architecture search (One-shot NAS) has been proposed as a time-efficient approach to obtain optimal subnet architectures and weights under different complexity cases by training only once. However, the subnet performance obtained by weight sharing is often inferior to the performance achieved by retraining. In this paper, we investigate the performance gap and attribute it to the u… ▽ More One-shot Neural architecture search (One-shot NAS) has been proposed as a time-efficient approach to obtain optimal subnet architectures and weights under different complexity cases by training only once. However, the subnet performance obtained by weight sharing is often inferior to the performance achieved by retraining. In this paper, we investigate the performance gap and attribute it to the use of uniform sampling, which is a common approach in supernet training. Uniform sampling concentrates training resources on subnets with intermediate computational resources, which are sampled with high probability. However, subnets with different complexity regions require different optimal training strategies for optimal performance. To address the problem of uniform sampling, we propose ShiftNAS, a method that can adjust the sampling probability based on the complexity of subnets. We achieve this by evaluating the performance variation of subnets with different complexity and designing an architecture generator that can accurately and efficiently provide subnets with the desired complexity. Both the sampling probability and the architecture generator can be trained end-to-end in a gradient-based manner. With ShiftNAS, we can directly obtain the optimal model architecture and parameters for a given computational complexity. We evaluate our approach on multiple visual network models, including convolutional neural networks (CNNs) and vision transformers (ViTs), and demonstrate that ShiftNAS is model-agnostic. Experimental results on ImageNet show that ShiftNAS can improve the performance of one-shot NAS without additional consumption. Source codes are available at https://github.com/bestfleer/ShiftNAS. △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: accepted by iccv 2023

arXiv:2307.02146 [pdf, other]

LOAF-M2L: Joint Learning of Wording and Formatting for Singable Melody-to-Lyric Generation

Authors: Longshen Ou, Xichu Ma, Ye Wang

Abstract: Despite previous efforts in melody-to-lyric generation research, there is still a significant compatibility gap between generated lyrics and melodies, negatively impacting the singability of the outputs. This paper bridges the singability gap with a novel approach to generating singable lyrics by jointly Learning wOrding And Formatting during Melody-to-Lyric training. After general-domain pretrain… ▽ More Despite previous efforts in melody-to-lyric generation research, there is still a significant compatibility gap between generated lyrics and melodies, negatively impacting the singability of the outputs. This paper bridges the singability gap with a novel approach to generating singable lyrics by jointly Learning wOrding And Formatting during Melody-to-Lyric training. After general-domain pretraining, our proposed model acquires length awareness first from a large text-only lyric corpus. Then, we introduce a new objective informed by musicological research on the relationship between melody and lyrics during melody-to-lyric training, which enables the model to learn the fine-grained format requirements of the melody. Our model achieves 3.75% and 21.44% absolute accuracy gains in the outputs' number-of-line and syllable-per-line requirements compared to naive fine-tuning, without sacrificing text fluency. Furthermore, our model demonstrates a 63.92% and 74.18% relative improvement of music-lyric compatibility and overall quality in the subjective evaluation, compared to the state-of-the-art melody-to-lyric generation model, highlighting the significance of formatting learning. △ Less

Submitted 19 July, 2024; v1 submitted 5 July, 2023; originally announced July 2023.

Comments: An extension of our previous work arXiv:2305.16816 [cs.CL]

arXiv:2306.02243 [pdf, other]

Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification

Authors: Jintao Rong, Hao Chen, Tianxiao Chen, Linlin Ou, Xinyi Yu, Yifan Liu

Abstract: Prompt learning has become a popular approach for adapting large vision-language models, such as CLIP, to downstream tasks. Typically, prompt learning relies on a fixed prompt token or an input-conditional token to fit a small amount of data under full supervision. While this paradigm can generalize to a certain range of unseen classes, it may struggle when domain gap increases, such as in fine-gr… ▽ More Prompt learning has become a popular approach for adapting large vision-language models, such as CLIP, to downstream tasks. Typically, prompt learning relies on a fixed prompt token or an input-conditional token to fit a small amount of data under full supervision. While this paradigm can generalize to a certain range of unseen classes, it may struggle when domain gap increases, such as in fine-grained classification and satellite image segmentation. To address this limitation, we propose Retrieval-enhanced Prompt learning (RePrompt), which introduces retrieval mechanisms to cache the knowledge representations from downstream tasks. we first construct a retrieval database from training examples, or from external examples when available. We then integrate this retrieval-enhanced mechanism into various stages of a simple prompt learning baseline. By referencing similar samples in the training set, the enhanced model is better able to adapt to new tasks with few samples. Our extensive experiments over 15 vision datasets, including 11 downstream tasks with few-shot setting and 4 domain generalization benchmarks, demonstrate that RePrompt achieves considerably improved performance. Our proposed approach provides a promising solution to the challenges faced by prompt learning when domain gap increases. The code and models will be available. △ Less

Submitted 18 June, 2024; v1 submitted 3 June, 2023; originally announced June 2023.

arXiv:2305.18403 [pdf, other]

LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning

Authors: Mingyang Zhang, Hao Chen, Chunhua Shen, Zhen Yang, Linlin Ou, Xinyi Yu, Bohan Zhuang

Abstract: Large Language Models (LLMs), such as LLaMA and T5, have shown exceptional performance across various tasks through fine-tuning. Although low-rank adaption (LoRA) has emerged to cheaply fine-tune these LLMs on downstream tasks, their deployment is still hindered by the vast model scale and computational costs. Post-training model pruning offers a way to compress LLMs. However, the current pruning… ▽ More Large Language Models (LLMs), such as LLaMA and T5, have shown exceptional performance across various tasks through fine-tuning. Although low-rank adaption (LoRA) has emerged to cheaply fine-tune these LLMs on downstream tasks, their deployment is still hindered by the vast model scale and computational costs. Post-training model pruning offers a way to compress LLMs. However, the current pruning methods designed for LLMs are not compatible with LoRA. This is due to their utilization of unstructured pruning on LLMs, impeding the merging of LoRA weights, or their dependence on the gradients of pre-trained weights to guide pruning, which can impose significant memory overhead. To this end, we propose LoRAPrune, a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner. Specifically, we first design a LoRA-guided pruning criterion, which uses the weights and gradients of LoRA, rather than the gradients of pre-trained weights for importance estimation. We subsequently integrate this criterion into an iterative pruning process, effectively removing redundant channels and heads. Extensive experimental results demonstrate the superior performance of our LoRAPrune over existing approaches on the LLaMA series models. At a 50\% compression rate, LoRAPrune demonstrates superior performance over LLM-Pruner, achieving a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%. Besides, LoRAPrune also matches semi-structural pruning across multiple LLMs, proving its wide applicability. The code is available at https://github.com/aim-uofa/LoRAPrune. △ Less

Submitted 6 August, 2024; v1 submitted 28 May, 2023; originally announced May 2023.

Comments: accepted by acl 2024 findings

arXiv:2305.17306 [pdf, other]

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

Authors: Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, Tushar Khot

Abstract: As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models. We are interested in this setting for two reasons: (1) from the behavior of GPT and PaLM model family, we observe that complex re… ▽ More As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models. We are interested in this setting for two reasons: (1) from the behavior of GPT and PaLM model family, we observe that complex reasoning is likely to be a key differentiator between weaker and stronger LLMs; (2) we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications, this naturally requires the foundation models to perform complex tasks that often involve the composition of linguistic and logical operations. Our approach is to compile a suite of challenging reasoning benchmarks to track the progress of LLMs. Our current results show that: (1) model scale clearly correlates with reasoning capabilities; (2) As of May 2023, Claude-v1.3 and PaLM-2 are the only two models that are comparable with GPT-4, while open-sourced models still lag behind; (3) LLaMA-65B performs closely to code-davinci-002, indicating that with successful further development such as reinforcement learning from human feedback (RLHF), it has great potential to be close to GPT-3.5-Turbo. Our results also suggest that for the open-source efforts to catch up, the community may focus more on building better base models and exploring RLHF. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: Preprint. Code at https://github.com/FranxYao/chain-of-thought-hub

arXiv:2305.16816 [pdf, other]

Songs Across Borders: Singable and Controllable Neural Lyric Translation

Authors: Longshen Ou, Xichu Ma, Min-Yen Kan, Ye Wang

Abstract: The development of general-domain neural machine translation (NMT) methods has advanced significantly in recent years, but the lack of naturalness and musical constraints in the outputs makes them unable to produce singable lyric translations. This paper bridges the singability quality gap by formalizing lyric translation into a constrained translation problem, converting theoretical guidance and… ▽ More The development of general-domain neural machine translation (NMT) methods has advanced significantly in recent years, but the lack of naturalness and musical constraints in the outputs makes them unable to produce singable lyric translations. This paper bridges the singability quality gap by formalizing lyric translation into a constrained translation problem, converting theoretical guidance and practical techniques from translatology literature to prompt-driven NMT approaches, exploring better adaptation methods, and instantiating them to an English-Chinese lyric translation system. Our model achieves 99.85%, 99.00%, and 95.52% on length accuracy, rhyme accuracy, and word boundary recall. In our subjective evaluation, our model shows a 75% relative enhancement on overall quality, compared against naive fine-tuning (Code available at https://github.com/Sonata165/ControllableLyricTranslation). △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: Accepted by ACL 2023. Camera-ready version

MSC Class: 68T50

arXiv:2304.12082 [pdf, other]

Deep Audio-Visual Singing Voice Transcription based on Self-Supervised Learning Models

Authors: Xiangming Gu, Wei Zeng, Jianan Zhang, Longshen Ou, Ye Wang

Abstract: Singing voice transcription converts recorded singing audio to musical notation. Sound contamination (such as accompaniment) and lack of annotated data make singing voice transcription an extremely difficult task. We take two approaches to tackle the above challenges: 1) introducing multimodal learning for singing voice transcription together with a new multimodal singing dataset, N20EMv2, enhanci… ▽ More Singing voice transcription converts recorded singing audio to musical notation. Sound contamination (such as accompaniment) and lack of annotated data make singing voice transcription an extremely difficult task. We take two approaches to tackle the above challenges: 1) introducing multimodal learning for singing voice transcription together with a new multimodal singing dataset, N20EMv2, enhancing noise robustness by utilizing video information (lip movements to predict the onset/offset of notes), and 2) adapting self-supervised learning models from the speech domain to the singing voice transcription task, significantly reducing annotated data requirements while preserving pretrained features. We build a self-supervised learning based audio-only singing voice transcription system, which not only outperforms current state-of-the-art technologies as a strong baseline, but also generalizes well to out-of-domain singing data. We then develop a self-supervised learning based video-only singing voice transcription system that detects note onsets and offsets with an accuracy of about 80\%. Finally, based on the powerful acoustic and visual representations extracted by the above two systems as well as the feature fusion design, we create an audio-visual singing voice transcription system that improves the noise robustness significantly under different acoustic environments compared to the audio-only systems. △ Less

Submitted 24 April, 2023; originally announced April 2023.

arXiv:2304.09694 [pdf, other]

CrossFusion: Interleaving Cross-modal Complementation for Noise-resistant 3D Object Detection

Authors: Yang Yang, Weijie Ma, Hao Chen, Linlin Ou, Xinyi Yu

Abstract: The combination of LiDAR and camera modalities is proven to be necessary and typical for 3D object detection according to recent studies. Existing fusion strategies tend to overly rely on the LiDAR modal in essence, which exploits the abundant semantics from the camera sensor insufficiently. However, existing methods cannot rely on information from other modalities because the corruption of LiDAR… ▽ More The combination of LiDAR and camera modalities is proven to be necessary and typical for 3D object detection according to recent studies. Existing fusion strategies tend to overly rely on the LiDAR modal in essence, which exploits the abundant semantics from the camera sensor insufficiently. However, existing methods cannot rely on information from other modalities because the corruption of LiDAR features results in a large domain gap. Following this, we propose CrossFusion, a more robust and noise-resistant scheme that makes full use of the camera and LiDAR features with the designed cross-modal complementation strategy. Extensive experiments we conducted show that our method not only outperforms the state-of-the-art methods under the setting without introducing an extra depth estimation network but also demonstrates our model's noise resistance without re-training for the specific malfunction scenarios by increasing 5.2\% mAP and 2.4\% NDS. △ Less

Submitted 19 April, 2023; originally announced April 2023.

arXiv:2301.12726 [pdf, other]

Specializing Smaller Language Models towards Multi-Step Reasoning

Authors: Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot

Abstract: The surprising ability of Large Language Models (LLMs) to perform well on complex reasoning with only few-shot chain-of-thought prompts is believed to emerge only in very large-scale models (100+ billion parameters). We show that such abilities can, in fact, be distilled down from GPT-3.5 ($\ge$ 175B) to T5 variants ($\le$ 11B). We propose model specialization, to specialize the model's ability to… ▽ More The surprising ability of Large Language Models (LLMs) to perform well on complex reasoning with only few-shot chain-of-thought prompts is believed to emerge only in very large-scale models (100+ billion parameters). We show that such abilities can, in fact, be distilled down from GPT-3.5 ($\ge$ 175B) to T5 variants ($\le$ 11B). We propose model specialization, to specialize the model's ability towards a target task. The hypothesis is that large models (commonly viewed as larger than 100B) have strong modeling power, but are spread on a large spectrum of tasks. Small models (commonly viewed as smaller than 10B) have limited model capacity, but if we concentrate their capacity on a specific target task, the model can achieve a decent improved performance. We use multi-step math reasoning as our testbed because it is a very typical emergent ability. We show two important aspects of model abilities: (1). there exists a very complex balance/ tradeoff between language models' multi-dimensional abilities; (2). by paying the price of decreased generic ability, we can clearly lift up the scaling curve of models smaller than 10B towards a specialized multi-step math reasoning ability. We further give comprehensive discussions about important design choices for better generalization, including the tuning data format, the start model checkpoint, and a new model selection method. We hope our practice and discoveries can serve as an important attempt towards specialized smaller models in the new research paradigm set by LLMs. △ Less

Submitted 30 January, 2023; originally announced January 2023.

Comments: Preprint

arXiv:2207.09747 [pdf, other]

Transfer Learning of wav2vec 2.0 for Automatic Lyric Transcription

Authors: Longshen Ou, Xiangming Gu, Ye Wang

Abstract: Automatic speech recognition (ASR) has progressed significantly in recent years due to the emergence of large-scale datasets and the self-supervised learning (SSL) paradigm. However, as its counterpart problem in the singing domain, the development of automatic lyric transcription (ALT) suffers from limited data and degraded intelligibility of sung lyrics. To fill in the performance gap between AL… ▽ More Automatic speech recognition (ASR) has progressed significantly in recent years due to the emergence of large-scale datasets and the self-supervised learning (SSL) paradigm. However, as its counterpart problem in the singing domain, the development of automatic lyric transcription (ALT) suffers from limited data and degraded intelligibility of sung lyrics. To fill in the performance gap between ALT and ASR, we attempt to exploit the similarities between speech and singing. In this work, we propose a transfer-learning-based ALT solution that takes advantage of these similarities by adapting wav2vec 2.0, an SSL ASR model, to the singing domain. We maximize the effectiveness of transfer learning by exploring the influence of different transfer starting points. We further enhance the performance by extending the original CTC model to a hybrid CTC/attention model. Our method surpasses previous approaches by a large margin on various ALT benchmark datasets. Further experiments show that, with even a tiny proportion of training data, our method still achieves competitive performance. △ Less

Submitted 16 October, 2022; v1 submitted 20 July, 2022; originally announced July 2022.

Comments: Camera ready version of ISMIR 2022 submission

arXiv:2207.06127 [pdf, other]

doi 10.1145/3503161.3548411

MM-ALT: A Multimodal Automatic Lyric Transcription System

Authors: Xiangming Gu, Longshen Ou, Danielle Ong, Ye Wang

Abstract: Automatic lyric transcription (ALT) is a nascent field of study attracting increasing interest from both the speech and music information retrieval communities, given its significant application potential. However, ALT with audio data alone is a notoriously difficult task due to instrumental accompaniment and musical constraints resulting in degradation of both the phonetic cues and the intelligib… ▽ More Automatic lyric transcription (ALT) is a nascent field of study attracting increasing interest from both the speech and music information retrieval communities, given its significant application potential. However, ALT with audio data alone is a notoriously difficult task due to instrumental accompaniment and musical constraints resulting in degradation of both the phonetic cues and the intelligibility of sung lyrics. To tackle this challenge, we propose the MultiModal Automatic Lyric Transcription system (MM-ALT), together with a new dataset, N20EM, which consists of audio recordings, videos of lip movements, and inertial measurement unit (IMU) data of an earbud worn by the performing singer. We first adapt the wav2vec 2.0 framework from automatic speech recognition (ASR) to the ALT task. We then propose a video-based ALT method and an IMU-based voice activity detection (VAD) method. In addition, we put forward the Residual Cross Attention (RCA) mechanism to fuse data from the three modalities (i.e., audio, video, and IMU). Experiments show the effectiveness of our proposed MM-ALT system, especially in terms of noise robustness. Project page is at https://n20em.github.io. △ Less

Submitted 17 February, 2023; v1 submitted 13 July, 2022; originally announced July 2022.

Comments: Accepted by ACM Multimedia 2022. Camera ready version and appendix

arXiv:2206.15109 [pdf]

MKIoU Loss: Towards Accurate Oriented Object Detection in Aerial Images

Authors: Xinyi Yu, Jiangping Lu, Xinyi Yu, Mi Lin, Linlin Ou

Abstract: Oriented bounding box regression is crucial for oriented object detection. However, regression-based methods often suffer from boundary problems and the inconsistency between loss and evaluation metrics. In this paper, a modulated Kalman IoU loss of approximate SkewIoU is proposed, named MKIoU. To avoid boundary problems, we convert the oriented bounding box to Gaussian distribution, then use the… ▽ More Oriented bounding box regression is crucial for oriented object detection. However, regression-based methods often suffer from boundary problems and the inconsistency between loss and evaluation metrics. In this paper, a modulated Kalman IoU loss of approximate SkewIoU is proposed, named MKIoU. To avoid boundary problems, we convert the oriented bounding box to Gaussian distribution, then use the Kalman filter to approximate the intersection area. However, there exists significant difference between the calculated and actual intersection areas. Thus, we propose a modulation factor to adjust the sensitivity of angle deviation and width-height offset to loss variation, making the loss more consistent with the evaluation metric. Furthermore, the Gaussian modeling method avoids the boundary problem but causes the angle confusion of square objects simultaneously. Thus, the Gaussian Angle Loss (GA Loss) is presented to solve this problem by adding a corrected loss for square targets. The proposed GA Loss can be easily extended to other Gaussian-based methods. Experiments on three publicly available aerial image datasets, DOTA, UCAS-AOD, and HRSC2016, show the effectiveness of the proposed method. △ Less

Submitted 30 June, 2022; originally announced June 2022.

arXiv:2205.09830 [pdf, ps, other]

Towards Understanding Gender-Seniority Compound Bias in Natural Language Generation

Authors: Samhita Honnavalli, Aesha Parekh, Lily Ou, Sophie Groenwold, Sharon Levy, Vicente Ordonez, William Yang Wang

Abstract: Women are often perceived as junior to their male counterparts, even within the same job titles. While there has been significant progress in the evaluation of gender bias in natural language processing (NLP), existing studies seldom investigate how biases toward gender groups change when compounded with other societal biases. In this work, we investigate how seniority impacts the degree of gender… ▽ More Women are often perceived as junior to their male counterparts, even within the same job titles. While there has been significant progress in the evaluation of gender bias in natural language processing (NLP), existing studies seldom investigate how biases toward gender groups change when compounded with other societal biases. In this work, we investigate how seniority impacts the degree of gender bias exhibited in pretrained neural generation models by introducing a novel framework for probing compound bias. We contribute a benchmark robustness-testing dataset spanning two domains, U.S. senatorship and professorship, created using a distant-supervision method. Our dataset includes human-written text with underlying ground truth and paired counterfactuals. We then examine GPT-2 perplexity and the frequency of gendered language in generated text. Our results show that GPT-2 amplifies bias by considering women as junior and men as senior more often than the ground truth in both domains. These results suggest that NLP applications built using GPT-2 may harm women in professional capacities. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: 6 pages, LREC 2022

arXiv:2205.02003 [pdf, other]

Multi-subgoal Robot Navigation in Crowds with History Information and Interactions

Authors: Xinyi Yu, Jianan Hu, Yuehai Fan, Wancai Zheng, Linlin Ou

Abstract: Robot navigation in dynamic environments shared with humans is an important but challenging task, which suffers from performance deterioration as the crowd grows. In this paper, multi-subgoal robot navigation approach based on deep reinforcement learning is proposed, which can reason about more comprehensive relationships among all agents (robot and humans). Specifically, the next position point i… ▽ More Robot navigation in dynamic environments shared with humans is an important but challenging task, which suffers from performance deterioration as the crowd grows. In this paper, multi-subgoal robot navigation approach based on deep reinforcement learning is proposed, which can reason about more comprehensive relationships among all agents (robot and humans). Specifically, the next position point is planned for the robot by introducing history information and interactions in our work. Firstly, based on subgraph network, the history information of all agents is aggregated before encoding interactions through a graph neural network, so as to improve the ability of the robot to anticipate the future scenarios implicitly. Further consideration, in order to reduce the probability of unreliable next position points, the selection module is designed after policy network in the reinforcement learning framework. In addition, the next position point generated from the selection module satisfied the task requirements better than that obtained directly from the policy network. The experiments demonstrate that our approach outperforms state-of-the-art approaches in terms of both success rate and collision rate, especially in crowded human environments. △ Less

Submitted 29 November, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

arXiv:2204.06403 [pdf, other]

Efficient Re-parameterization Operations Search for Easy-to-Deploy Network Based on Directional Evolutionary Strategy

Authors: Xinyi Yu, Xiaowei Wang, Jintao Rong, Mingyang Zhang, Linlin Ou

Abstract: Structural re-parameterization (Rep) methods has achieved significant performance improvement on traditional convolutional network. Most current Rep methods rely on prior knowledge to select the reparameterization operations. However, the performance of architecture is limited by the type of operations and prior knowledge. To break this restriction, in this work, an improved re-parameterization se… ▽ More Structural re-parameterization (Rep) methods has achieved significant performance improvement on traditional convolutional network. Most current Rep methods rely on prior knowledge to select the reparameterization operations. However, the performance of architecture is limited by the type of operations and prior knowledge. To break this restriction, in this work, an improved re-parameterization search space is designed, which including more type of re-parameterization operations. Concretely, the performance of convolutional networks can be further improved by the search space. To effectively explore this search space, an automatic re-parameterization enhancement strategy is designed based on neural architecture search (NAS), which can search a excellent re-parameterization architecture. Besides, we visualize the output features of the architecture to analyze the reasons for the formation of the re-parameterization architecture. On public datasets, we achieve better results. Under the same training conditions as ResNet, we improve the accuracy of ResNet-50 by 1.82% on ImageNet-1k. △ Less

Submitted 3 July, 2022; v1 submitted 13 April, 2022; originally announced April 2022.

Comments: 21pages, 8figures

arXiv:2204.03898 [pdf, other]

Exploring Transformer's potential on automatic piano transcription

Authors: Longshen Ou, Ziyi Guo, Emmanouil Benetos, Jiqing Han, Ye Wang

Abstract: Most recent research about automatic music transcription (AMT) uses convolutional neural networks and recurrent neural networks to model the mapping from music signals to symbolic notation. Based on a high-resolution piano transcription system, we explore the possibility of incorporating another powerful sequence transformation tool -- the Transformer -- to deal with the AMT problem. We argue that… ▽ More Most recent research about automatic music transcription (AMT) uses convolutional neural networks and recurrent neural networks to model the mapping from music signals to symbolic notation. Based on a high-resolution piano transcription system, we explore the possibility of incorporating another powerful sequence transformation tool -- the Transformer -- to deal with the AMT problem. We argue that the properties of the Transformer make it more suitable for certain AMT subtasks. We confirm the Transformer's superiority on the velocity detection task by experiments on the MAESTRO dataset and a cross-dataset evaluation on the MAPS dataset. We observe a performance improvement on both frame-level and note-level metrics after introducing the Transformer network. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: Accepted by ICASSP 2022

ACM Class: H.5.5

arXiv:2201.02726 [pdf]

doi 10.1088/1361-6501/ac750c

Real-time Rail Recognition Based on 3D Point Clouds

Authors: Xinyi Yu, Weiqi He, Xuecheng Qian, Yang Yang, Linlin Ou

Abstract: Accurate rail location is a crucial part in the railway support driving system for safety monitoring. LiDAR can obtain point clouds that carry 3D information for the railway environment, especially in darkness and terrible weather conditions. In this paper, a real-time rail recognition method based on 3D point clouds is proposed to solve the challenges, such as disorderly, uneven density and large… ▽ More Accurate rail location is a crucial part in the railway support driving system for safety monitoring. LiDAR can obtain point clouds that carry 3D information for the railway environment, especially in darkness and terrible weather conditions. In this paper, a real-time rail recognition method based on 3D point clouds is proposed to solve the challenges, such as disorderly, uneven density and large volume of the point clouds. A voxel down-sampling method is first presented for density balanced of railway point clouds, and pyramid partition is designed to divide the 3D scanning area into the voxels with different volumes. Then, a feature encoding module is developed to find the nearest neighbor points and to aggregate their local geometric features for the center point. Finally, a multi-scale neural network is proposed to generate the prediction results of each voxel and the rail location. The experiments are conducted under 9 sequences of 3D point cloud data for the railway. The results show that the method has good performance in detecting straight, curved and other complex topologies rails. △ Less

Submitted 7 January, 2022; originally announced January 2022.

arXiv:2112.15358 [pdf, other]

Conditional Generative Data-free Knowledge Distillation

Authors: Xinyi Yu, Ling Yan, Yang Yang, Libo Zhou, Linlin Ou

Abstract: Knowledge distillation has made remarkable achievements in model compression. However, most existing methods require the original training data, which is usually unavailable due to privacy and security issues. In this paper, we propose a conditional generative data-free knowledge distillation (CGDD) framework for training lightweight networks without any training data. This method realizes efficie… ▽ More Knowledge distillation has made remarkable achievements in model compression. However, most existing methods require the original training data, which is usually unavailable due to privacy and security issues. In this paper, we propose a conditional generative data-free knowledge distillation (CGDD) framework for training lightweight networks without any training data. This method realizes efficient knowledge distillation based on conditional image generation. Specifically, we treat the preset labels as ground truth to train a conditional generator in a semi-supervised manner. The trained generator can produce specified classes of training images. For training the student network, we force it to extract the knowledge hidden in teacher feature maps, which provide crucial cues for the learning process. Moreover, an adversarial training framework for promoting distillation performance is constructed by designing several loss functions. This framework helps the student model to explore larger data space. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on different datasets. Compared with other data-free works, our work obtains state-of-the-art results on CIFAR100, Caltech101, and different versions of ImageNet datasets. The codes will be released. △ Less

Submitted 12 August, 2022; v1 submitted 31 December, 2021; originally announced December 2021.

arXiv:2111.02283 [pdf, other]

A Self-adaptive LSAC-PID Approach based on Lyapunov Reward Shaping for Mobile Robots

Authors: Xinyi Yu, Siyu Xu, Yuehai Fan, Linlin Ou

Abstract: To solve the coupling problem of control loops and the adaptive parameter tuning problem in the multi-input multi-output (MIMO) PID control system, a self-adaptive LSAC-PID algorithm is proposed based on deep reinforcement learning (RL) and Lyapunov-based reward shaping in this paper. For complex and unknown mobile robot control environment, an RL-based MIMO PID hybrid control strategy is firstly… ▽ More To solve the coupling problem of control loops and the adaptive parameter tuning problem in the multi-input multi-output (MIMO) PID control system, a self-adaptive LSAC-PID algorithm is proposed based on deep reinforcement learning (RL) and Lyapunov-based reward shaping in this paper. For complex and unknown mobile robot control environment, an RL-based MIMO PID hybrid control strategy is firstly presented. According to the dynamic information and environmental feedback of the mobile robot, the RL agent can output the optimal MIMO PID parameters in real time, without knowing mathematical model and decoupling multiple control loops. Then, to improve the convergence speed of RL and the stability of mobile robots, a Lyapunov-based reward shaping soft actor-critic (LSAC) algorithm is proposed based on Lyapunov theory and potential-based reward shaping method. The convergence and optimality of the algorithm are proved in terms of the policy evaluation and improvement step of soft policy iteration. In addition, for line-following robots, the region growing method is improved to adapt to the influence of forks and environmental interference. Through comparison, test and cross-validation, the simulation and real-environment experimental results all show good performance of the proposed LSAC-PID tuning algorithm. △ Less

Submitted 3 November, 2021; originally announced November 2021.

Comments: 11 pages, 13 figures

arXiv:2110.05842 [pdf, other]

Across-Task Neural Architecture Search via Meta Learning

Authors: Jingtao Rong, Xinyi Yu, Mingyang Zhang, Linlin Ou

Abstract: Adequate labeled data and expensive compute resources are the prerequisites for the success of neural architecture search(NAS). It is challenging to apply NAS in meta-learning scenarios with limited compute resources and data. In this paper, an across-task neural architecture search (AT-NAS) is proposed to address the problem through combining gradient-based meta-learning with EA-based NAS to lear… ▽ More Adequate labeled data and expensive compute resources are the prerequisites for the success of neural architecture search(NAS). It is challenging to apply NAS in meta-learning scenarios with limited compute resources and data. In this paper, an across-task neural architecture search (AT-NAS) is proposed to address the problem through combining gradient-based meta-learning with EA-based NAS to learn over the distribution of tasks. The supernet is learned over an entire set of tasks by meta-learning its weights. Architecture encodes of subnets sampled from the supernet are iteratively adapted by evolutionary algorithms while simultaneously searching for a task-sensitive meta-network. Searched meta-network can be adapted to a novel task via a few learning steps and only costs a little search time. Empirical results show that AT-NAS surpasses the related approaches on few-shot classification accuracy. The performance of AT-NAS on classification benchmarks is comparable to that of models searched from scratch, by adapting the architecture in less than an hour from a 5-GPU-day pretrained meta-network. △ Less

Submitted 12 October, 2021; originally announced October 2021.

arXiv:2109.10187 [pdf]

Oriented Object Detection in Aerial Images Based on Area Ratio of Parallelogram

Authors: Xinyi Yu, Mi Lin, Jiangping Lu, Linlin Ou

Abstract: Oriented object detection is a challenging task in aerial images since the objects in aerial images are displayed in arbitrary directions and are frequently densely packed. The mainstream detectors describe rotating objects using a five-parament or eight-parament representations, which suffer from representation ambiguity for orientated object definition. In this paper, we propose a novel represen… ▽ More Oriented object detection is a challenging task in aerial images since the objects in aerial images are displayed in arbitrary directions and are frequently densely packed. The mainstream detectors describe rotating objects using a five-parament or eight-parament representations, which suffer from representation ambiguity for orientated object definition. In this paper, we propose a novel representation method based on area ratio of parallelogram, called ARP. Specifically, ARP regresses the minimum bounding rectangle of the oriented object and three area ratios. Three area ratios include the area ratio of a directed object to the smallest circumscribed rectangle and two parallelograms to the minimum circumscribed rectangle. It simplifies offset learning and eliminates the issue of angular periodicity or label point sequences for oriented objects. To further remedy the confusion issue of nearly horizontal objects, the area ratio between the object and its minimal circumscribed rectangle is employed to guide the selection of horizontal or oriented detection for each object. Moreover, the rotated efficient Intersection over Union (R-EIoU) loss with horizontal bounding box and three area ratios are designed to optimize the bounding box regression for rotating objects. Experimental results on remote sensing datasets, including HRSC2016, DOTA, and UCAS-AOD, show that our method achieves superior detection performance than many state-of-the-art approaches. △ Less

Submitted 8 November, 2021; v1 submitted 21 September, 2021; originally announced September 2021.

arXiv:2109.03508 [pdf, other]

RepNAS: Searching for Efficient Re-parameterizing Blocks

Authors: Mingyang Zhang, Xinyi Yu, Jingtao Rong, Linlin Ou

Abstract: In the past years, significant improvements in the field of neural architecture search(NAS) have been made. However, it is still challenging to search for efficient networks due to the gap between the searched constraint and real inference time exists. To search for a high-performance network with low inference time, several previous works set a computational complexity constraint for the search a… ▽ More In the past years, significant improvements in the field of neural architecture search(NAS) have been made. However, it is still challenging to search for efficient networks due to the gap between the searched constraint and real inference time exists. To search for a high-performance network with low inference time, several previous works set a computational complexity constraint for the search algorithm. However, many factors affect the speed of inference(e.g., FLOPs, MACs). The correlation between a single indicator and the latency is not strong. Currently, some re-parameterization(Rep) techniques are proposed to convert multi-branch to single-path architecture which is inference-friendly. Nevertheless, multi-branch architectures are still human-defined and inefficient. In this work, we propose a new search space that is suitable for structural re-parameterization techniques. RepNAS, a one-stage NAS approach, is present to efficiently search the optimal diverse branch block(ODBB) for each layer under the branch number constraint. Our experimental results show the searched ODBB can easily surpass the manual diverse branch block(DBB) with efficient training. △ Less

Submitted 14 June, 2022; v1 submitted 8 September, 2021; originally announced September 2021.

arXiv:2108.04123 [pdf, other]

DP-DNA: A Digital Pattern-Aware DNA Storage System to Improve Encoding Density

Authors: Bingzhe Li, Li Ou, David Du

Abstract: With the rapid increase of available digital data, DNA storage is identified as a storage media with high density and capability of long-term preservation, especially for archival storage systems. However, the encoding density (i.e., how many binary bits can be encoded into one nucleotide) and error handling are two major factors intertwined in DNA storage. Considering encoding density, theoretica… ▽ More With the rapid increase of available digital data, DNA storage is identified as a storage media with high density and capability of long-term preservation, especially for archival storage systems. However, the encoding density (i.e., how many binary bits can be encoded into one nucleotide) and error handling are two major factors intertwined in DNA storage. Considering encoding density, theoretically, one nucleotide can encode two binary bits (upper bound). However, due to biochemical constraints and other necessary information associated with payload, the encoding densities of various DNA storage systems are much less than this upper bound. Additionally, all existing studies of DNA encoding schemes are based on static analysis and really lack the awareness of dynamically changed digital patterns. Therefore, the gap between the static encoding and dynamic binary patterns prevents achieving a higher encoding density for DNA storage systems. In this paper, we propose a new Digital Pattern-Aware DNA storage system, called DP-DNA, which can efficiently store digital data in DNA storage with high encoding density. DP-DNA maintains a set of encoding codes and uses a digital pattern-aware code (DPAC) to analyze the patterns of a binary sequence for a DNA strand and selects an appropriate code for encoding the binary sequence to achieve a high encoding density. An additional encoding field is added to the DNA encoding format, which can distinguish the encoding scheme used for those DNA strands, and thus we can decode DNA data back to its original digital data. Moreover, to further improve the encoding density, a variable-length scheme is proposed to increase the feasibility of the coding scheme with a high encoding density. Finally, the experimental results indicate that the proposed DP-DNA achieves up to 103.5% higher encoding densities than prior work. △ Less

Submitted 24 August, 2021; v1 submitted 9 August, 2021; originally announced August 2021.

Comments: 14 pages, 13 figures

arXiv:2106.06485 [pdf]

doi 10.1007/s11633-022-1321-8

Pedestrian Attribute Recognition in Video Surveillance Scenarios Based on View-attribute Attention Localization

Authors: Weichen Chen, Xinyi Yu, Linlin Ou

Abstract: Pedestrian attribute recognition in surveillance scenarios is still a challenging task due to the inaccurate localization of specific attributes. In this paper, we propose a novel view-attribute localization method based on attention (VALA), which utilizes view information to guide the recognition process to focus on specific attributes and attention mechanism to localize specific attribute-corres… ▽ More Pedestrian attribute recognition in surveillance scenarios is still a challenging task due to the inaccurate localization of specific attributes. In this paper, we propose a novel view-attribute localization method based on attention (VALA), which utilizes view information to guide the recognition process to focus on specific attributes and attention mechanism to localize specific attribute-corresponding areas. Concretely, view information is leveraged by the view prediction branch to generate four view weights that represent the confidences for attributes from different views. View weights are then delivered back to compose specific view-attributes, which will participate and supervise deep feature extraction. In order to explore the spatial location of a view-attribute, regional attention is introduced to aggregate spatial information and encode inter-channel dependencies of the view feature. Subsequently, a fine attentive attribute-specific region is localized, and regional weights for the view-attribute from different spatial locations are gained by the regional attention. The final view-attribute recognition outcome is obtained by combining the view weights with the regional weights. Experiments on three wide datasets (RAP, RAPv2, and PA-100K) demonstrate the effectiveness of our approach compared with state-of-the-art methods. △ Less

Submitted 19 December, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

Journal ref: Springer2022-Machine Intelligence Research

arXiv:2103.10686 [pdf, other]

doi 10.1002/rnc.5662

A Self-adaptive SAC-PID Control Approach based on Reinforcement Learning for Mobile Robots

Authors: Xinyi Yu, Yuehai Fan, Siyu Xu, Linlin Ou

Abstract: Proportional-integral-derivative (PID) control is the most widely used in industrial control, robot control and other fields. However, traditional PID control is not competent when the system cannot be accurately modeled and the operating environment is variable in real time. To tackle these problems, we propose a self-adaptive model-free SAC-PID control approach based on reinforcement learning fo… ▽ More Proportional-integral-derivative (PID) control is the most widely used in industrial control, robot control and other fields. However, traditional PID control is not competent when the system cannot be accurately modeled and the operating environment is variable in real time. To tackle these problems, we propose a self-adaptive model-free SAC-PID control approach based on reinforcement learning for automatic control of mobile robots. A new hierarchical structure is developed, which includes the upper controller based on soft actor-critic (SAC), one of the most competitive continuous control algorithms, and the lower controller based on incremental PID controller. Soft actor-critic receives the dynamic information of the mobile robot as input, and simultaneously outputs the optimal parameters of incremental PID controllers to compensate for the error between the path and the mobile robot in real time. In addition, the combination of 24-neighborhood method and polynomial fitting is developed to improve the adaptability of SAC-PID control method to complex environments. The effectiveness of the SAC-PID control method is verified with several different difficulty paths both on Gazebo and real mecanum mobile robot. Futhermore, compared with fuzzy PID control, the SAC-PID method has merits of strong robustness, generalization and real-time performance. △ Less

Submitted 19 March, 2021; originally announced March 2021.

Comments: 20 oages, 12 figures

Journal ref: Int J Robust Nolinear Control 31 (2021) 1-19

arXiv:2103.02847 [pdf, other]

doi 10.1145/3456727.3463771

IMG-DNA: Approximate DNA Storage for Images

Authors: Bingzhe Li, Li Ou, David Du

Abstract: Deoxyribonucleic Acid (DNA) as a storage medium with high density and long-term preservation properties can satisfy the requirement of archival storage for rapidly increased digital volume. The read and write processes of DNA storage are error-prone. Images widely used in social media have the properties of fault tolerance which are well fitted to the DNA storage. However, prior work simply invest… ▽ More Deoxyribonucleic Acid (DNA) as a storage medium with high density and long-term preservation properties can satisfy the requirement of archival storage for rapidly increased digital volume. The read and write processes of DNA storage are error-prone. Images widely used in social media have the properties of fault tolerance which are well fitted to the DNA storage. However, prior work simply investigated the feasibility of DNA storage storing different types of data and simply store images in DNA storage, which did not fully investigate the fault-tolerant potential of images in the DNA storage. In this paper, we proposed a new image-based DNA system called IMG-DNA, which can efficiently store images in DNA storage with improved DNA storage robustness. First, a new DNA architecture is proposed to fit JPEG-based images and improve the image's robustness in DNA storage. Moreover, barriers inserted in DNA sequences efficiently prevent error propagation in images of DNA storage. The experimental results indicate that the proposed IMG-DNA achieves much higher fault-tolerant than prior work. △ Less

Submitted 29 May, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

Comments: 11 pages, 12 figures

arXiv:2011.04908 [pdf, other]

Effective Model Compression via Stage-wise Pruning

Authors: Mingyang Zhang, Xinyi Yu, Jingtao Rong, Linlin Ou

Abstract: Automated Machine Learning(Auto-ML) pruning methods aim at searching a pruning strategy automatically to reduce the computational complexity of deep Convolutional Neural Networks(deep CNNs). However, some previous work found that the results of many Auto-ML pruning methods cannot even surpass the results of the uniformly pruning method. In this paper, the ineffectiveness of Auto-ML pruning which i… ▽ More Automated Machine Learning(Auto-ML) pruning methods aim at searching a pruning strategy automatically to reduce the computational complexity of deep Convolutional Neural Networks(deep CNNs). However, some previous work found that the results of many Auto-ML pruning methods cannot even surpass the results of the uniformly pruning method. In this paper, the ineffectiveness of Auto-ML pruning which is caused by unfull and unfair training of the supernet is shown. A deep supernet suffers from unfull training because it contains too many candidates. To overcome the unfull training, a stage-wise pruning(SWP) method is proposed, which splits a deep supernet into several stage-wise supernets to reduce the candidate number and utilize inplace distillation to supervise the stage training. Besides, A wide supernet is hit by unfair training since the sampling probability of each channel is unequal. Therefore, the fullnet and the tinynet are sampled in each training iteration to ensure each channel can be overtrained. Remarkably, the proxy performance of the subnets trained with SWP is closer to the actual performance than that of most of the previous Auto-ML pruning work. Experiments show that SWP achieves the state-of-the-art on both CIFAR-10 and ImageNet under the mobile setting. △ Less

Submitted 22 September, 2021; v1 submitted 10 November, 2020; originally announced November 2020.

arXiv:2010.08412 [pdf, other]

Vector-Vector-Matrix Architecture: A Novel Hardware-Aware Framework for Low-Latency Inference in NLP Applications

Authors: Matthew Khoury, Rumen Dangovski, Longwu Ou, Preslav Nakov, Yichen Shen, Li Jing

Abstract: Deep neural networks have become the standard approach to building reliable Natural Language Processing (NLP) applications, ranging from Neural Machine Translation (NMT) to dialogue systems. However, improving accuracy by increasing the model size requires a large number of hardware computations, which can slow down NLP applications significantly at inference time. To address this issue, we propos… ▽ More Deep neural networks have become the standard approach to building reliable Natural Language Processing (NLP) applications, ranging from Neural Machine Translation (NMT) to dialogue systems. However, improving accuracy by increasing the model size requires a large number of hardware computations, which can slow down NLP applications significantly at inference time. To address this issue, we propose a novel vector-vector-matrix architecture (VVMA), which greatly reduces the latency at inference time for NMT. This architecture takes advantage of specialized hardware that has low-latency vector-vector operations and higher-latency vector-matrix operations. It also reduces the number of parameters and FLOPs for virtually all models that rely on efficient matrix multipliers without significantly impacting accuracy. We present empirical results suggesting that our framework can reduce the latency of sequence-to-sequence and Transformer models used for NMT by a factor of four. Finally, we show evidence suggesting that our VVMA extends to other domains, and we discuss novel hardware for its efficient use. △ Less

Submitted 6 October, 2020; originally announced October 2020.

Comments: To appear at the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP '20), November 16-20, 2020, NMT, AI accelerators, co-design, TPU, OPU, 10 pages, 3 figures, 4 tables

arXiv:2010.02510 [pdf, other]

Investigating African-American Vernacular English in Transformer-Based Text Generation

Authors: Sophie Groenwold, Lily Ou, Aesha Parekh, Samhita Honnavalli, Sharon Levy, Diba Mirza, William Yang Wang

Abstract: The growth of social media has encouraged the written use of African American Vernacular English (AAVE), which has traditionally been used only in oral contexts. However, NLP models have historically been developed using dominant English varieties, such as Standard American English (SAE), due to text corpora availability. We investigate the performance of GPT-2 on AAVE text by creating a dataset o… ▽ More The growth of social media has encouraged the written use of African American Vernacular English (AAVE), which has traditionally been used only in oral contexts. However, NLP models have historically been developed using dominant English varieties, such as Standard American English (SAE), due to text corpora availability. We investigate the performance of GPT-2 on AAVE text by creating a dataset of intent-equivalent parallel AAVE/SAE tweet pairs, thereby isolating syntactic structure and AAVE- or SAE-specific language for each pair. We evaluate each sample and its GPT-2 generated text with pretrained sentiment classifiers and find that while AAVE text results in more classifications of negative sentiment than SAE, the use of GPT-2 generally increases occurrences of positive sentiment for both. Additionally, we conduct human evaluation of AAVE and SAE text generated with GPT-2 to compare contextual rigor and overall quality. △ Less

Submitted 29 October, 2020; v1 submitted 6 October, 2020; originally announced October 2020.

Comments: 7 pages, EMNLP 2020

arXiv:2007.13881 [pdf, other]

iESC: iterative Equivalent Surface Current Approximation

Authors: Shaolin Liao, Lu Ou

Abstract: A novel iterative Equivalent Surface Current (iESC) algorithm has been developed to simulate the electromagnetic scattering of electrically large dielectric objects with relatively smooth surfaces. The iESC algorithm corrects the surface currents to compensate for the electromagnetic field deviation across the dielectric surface. Numerically validation has been performed with a dielectric sphere t… ▽ More A novel iterative Equivalent Surface Current (iESC) algorithm has been developed to simulate the electromagnetic scattering of electrically large dielectric objects with relatively smooth surfaces. The iESC algorithm corrects the surface currents to compensate for the electromagnetic field deviation across the dielectric surface. Numerically validation has been performed with a dielectric sphere to show the performance of the iESC algorithm. The experimental result shows that it takes only a few iterations for the algorithm to increase the surface current accuracy by more than three orders of magnitude. △ Less

Submitted 14 July, 2020; originally announced July 2020.

Comments: 10 pages, 7 figures

arXiv:2004.13939 [pdf, ps, other]

Evaluating Transformer-Based Multilingual Text Classification

Authors: Sophie Groenwold, Samhita Honnavalli, Lily Ou, Aesha Parekh, Sharon Levy, Diba Mirza, William Yang Wang

Abstract: As NLP tools become ubiquitous in today's technological landscape, they are increasingly applied to languages with a variety of typological structures. However, NLP research does not focus primarily on typological differences in its analysis of state-of-the-art language models. As a result, NLP tools perform unequally across languages with different syntactic and morphological structures. Through… ▽ More As NLP tools become ubiquitous in today's technological landscape, they are increasingly applied to languages with a variety of typological structures. However, NLP research does not focus primarily on typological differences in its analysis of state-of-the-art language models. As a result, NLP tools perform unequally across languages with different syntactic and morphological structures. Through a detailed discussion of word order typology, morphological typology, and comparative linguistics, we identify which variables most affect language modeling efficacy; in addition, we calculate word order and morphological similarity indices to aid our empirical study. We then use this background to support our analysis of an experiment we conduct using multi-class text classification on eight languages and eight models. △ Less

Submitted 30 April, 2020; v1 submitted 28 April, 2020; originally announced April 2020.

Comments: Total of 15 pages (9 pages for paper, 2 pages for references, 4 pages for appendix). Changed title

arXiv:2003.01751 [pdf, other]

Automatic Hyper-Parameter Optimization Based on Mapping Discovery from Data to Hyper-Parameters

Authors: Bozhou Chen, Kaixin Zhang, Longshen Ou, Chenmin Ba, Hongzhi Wang, Chunnan Wang

Abstract: Machine learning algorithms have made remarkable achievements in the field of artificial intelligence. However, most machine learning algorithms are sensitive to the hyper-parameters. Manually optimizing the hyper-parameters is a common method of hyper-parameter tuning. However, it is costly and empirically dependent. Automatic hyper-parameter optimization (autoHPO) is favored due to its effective… ▽ More Machine learning algorithms have made remarkable achievements in the field of artificial intelligence. However, most machine learning algorithms are sensitive to the hyper-parameters. Manually optimizing the hyper-parameters is a common method of hyper-parameter tuning. However, it is costly and empirically dependent. Automatic hyper-parameter optimization (autoHPO) is favored due to its effectiveness. However, current autoHPO methods are usually only effective for a certain type of problems, and the time cost is high. In this paper, we propose an efficient automatic parameter optimization approach, which is based on the mapping from data to the corresponding hyper-parameters. To describe such mapping, we propose a sophisticated network structure. To obtain such mapping, we develop effective network constrution algorithms. We also design strategy to optimize the result futher during the application of the mapping. Extensive experimental results demonstrate that the proposed approaches outperform the state-of-the-art apporaches significantly. △ Less

Submitted 3 March, 2020; originally announced March 2020.

arXiv:1911.09817 [pdf, other]

Graph Pruning for Model Compression

Authors: Mingyang Zhang, Xinyi Yu, Jingtao Rong, Linlin Ou

Abstract: Previous AutoML pruning works utilized individual layer features to automatically prune filters. We analyze the correlation for two layers from the different blocks which have a short-cut structure. It shows that, in one block, the deeper layer has many redundant filters which can be represented by filters in the former layer. So, it is necessary to take information from other layers into consider… ▽ More Previous AutoML pruning works utilized individual layer features to automatically prune filters. We analyze the correlation for two layers from the different blocks which have a short-cut structure. It shows that, in one block, the deeper layer has many redundant filters which can be represented by filters in the former layer. So, it is necessary to take information from other layers into consideration in pruning. In this paper, a novel pruning method, named GraphPruning, is proposed. Any series of the network is viewed as a graph. To automatically aggregate neighboring features for each node, a graph aggregator based on graph convolution networks(GCN) is designed. In the training stage, a PruningNet that is given aggregated node features generates reasonable weights for any size of the sub-network. Subsequently, the best configuration of the Pruned Network is searched by reinforcement learning. Different from previous work, we take the node features from a well-trained graph aggregator instead of the hand-craft features, as the states in reinforcement learning. Compared with other AutoML pruning works, our method has achieved the state-of-the-art under the same conditions on ImageNet-2012. △ Less

Submitted 22 September, 2021; v1 submitted 21 November, 2019; originally announced November 2019.

Comments: accepted by Applied Intelligence

Showing 1–38 of 38 results for author: Ou, L