Search | arXiv e-print repository

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

Authors: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu

Abstract: Recently, speech generation models have made significant progress by using large-scale training data. However, the research community struggle to produce highly spontaneous and human-like speech due to the lack of large-scale, diverse, and spontaneous speech data. This paper present Emilia, the first multilingual speech generation dataset from in-the-wild speech data, and Emilia-Pipe, the first op… ▽ More Recently, speech generation models have made significant progress by using large-scale training data. However, the research community struggle to produce highly spontaneous and human-like speech due to the lack of large-scale, diverse, and spontaneous speech data. This paper present Emilia, the first multilingual speech generation dataset from in-the-wild speech data, and Emilia-Pipe, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation. Emilia starts with over 101k hours of speech in six languages and features diverse speech with varied speaking styles. To facilitate the scale-up of Emilia, the open-source pipeline Emilia-Pipe can process one hour of raw speech data ready for model training in a few mins, which enables the research community to collaborate on large-scale speech generation research. Experimental results validate the effectiveness of Emilia. Demos are available at: https://emilia-dataset.github.io/Emilia-Demo-Page/. △ Less

Submitted 12 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

Comments: Fix typos

arXiv:2406.18045 [pdf, other]

PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

Authors: Linqing Chen, Weilei Wang, Zilong Bai, Peng Xu, Yan Fang, Jie Fang, Wentao Wu, Lizhi Zhou, Ruiji Zhang, Yubin Xia, Chaobo Xu, Ran Hu, Licong Xu, Qijun Cai, Haoran Hua, Jing Sun, Jin Liu, Tian Qiu, Haowen Liu, Meng Hu, Xiuwen Li, Fei Gao, Yufu Wang, Lin Tie, Chaochao Wang , et al. (11 additional authors not shown)

Abstract: Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpo… ▽ More Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpose LLMs often fall short. In this study, we introduce PharmaGPT, a suite of domain specilized LLMs with 13 billion and 70 billion parameters, specifically trained on a comprehensive corpus tailored to the Bio-Pharmaceutical and Chemical domains. Our evaluation shows that PharmaGPT surpasses existing general models on specific-domain benchmarks such as NAPLEX, demonstrating its exceptional capability in domain-specific tasks. Remarkably, this performance is achieved with a model that has only a fraction, sometimes just one-tenth-of the parameters of general-purpose large models. This advancement establishes a new benchmark for LLMs in the bio-pharmaceutical and chemical fields, addressing the existing gap in specialized language modeling. It also suggests a promising path for enhanced research and development, paving the way for more precise and effective NLP applications in these areas. △ Less

Submitted 9 July, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

arXiv:2405.16785 [pdf, other]

PromptFix: You Prompt and We Fix the Photo

Authors: Yongsheng Yu, Ziyun Zeng, Hang Hua, Jianlong Fu, Jiebo Luo

Abstract: Diffusion models equipped with language models demonstrate excellent controllability in image generation tasks, allowing image processing to adhere to human instructions. However, the lack of diverse instruction-following data hampers the development of models that effectively recognize and execute user-customized instructions, particularly in low-level tasks. Moreover, the stochastic nature of th… ▽ More Diffusion models equipped with language models demonstrate excellent controllability in image generation tasks, allowing image processing to adhere to human instructions. However, the lack of diverse instruction-following data hampers the development of models that effectively recognize and execute user-customized instructions, particularly in low-level tasks. Moreover, the stochastic nature of the diffusion process leads to deficiencies in image generation or editing tasks that require the detailed preservation of the generated images. To address these limitations, we propose PromptFix, a comprehensive framework that enables diffusion models to follow human instructions to perform a wide variety of image-processing tasks. First, we construct a large-scale instruction-following dataset that covers comprehensive image-processing tasks, including low-level tasks, image editing, and object creation. Next, we propose a high-frequency guidance sampling method to explicitly control the denoising process and preserve high-frequency details in unprocessed areas. Finally, we design an auxiliary prompting adapter, utilizing Vision-Language Models (VLMs) to enhance text prompts and improve the model's task generalization. Experimental results show that PromptFix outperforms previous methods in various image-processing tasks. Our proposed model also achieves comparable inference efficiency with these baseline models and exhibits superior zero-shot capabilities in blind restoration and combination tasks. The dataset and code will be aviliable at https://github.com/yeates/PromptFix. △ Less

Submitted 26 May, 2024; originally announced May 2024.

arXiv:2404.18255 [pdf, other]

PatentGPT: A Large Language Model for Intellectual Property

Authors: Zilong Bai, Ruiji Zhang, Linqing Chen, Qijun Cai, Yuan Zhong, Cong Wang, Yan Fang, Jie Fang, Jing Sun, Weikuan Wang, Lizhi Zhou, Haoran Hua, Tian Qiu, Chaochao Wang, Cheng Sun, Jianping Lu, Yixin Wang, Yubin Xia, Meng Hu, Haowen Liu, Peng Xu, Licong Xu, Fu Bian, Xiaolong Gu, Lisha Zhang , et al. (2 additional authors not shown)

Abstract: In recent years, large language models(LLMs) have attracted significant attention due to their exceptional performance across a multitude of natural language process tasks, and have been widely applied in various fields. However, the application of large language models in the Intellectual Property (IP) domain is challenging due to the strong need for specialized knowledge, privacy protection, pro… ▽ More In recent years, large language models(LLMs) have attracted significant attention due to their exceptional performance across a multitude of natural language process tasks, and have been widely applied in various fields. However, the application of large language models in the Intellectual Property (IP) domain is challenging due to the strong need for specialized knowledge, privacy protection, processing of extremely long text in this field. In this technical report, we present for the first time a low-cost, standardized procedure for training IP-oriented LLMs, meeting the unique requirements of the IP domain. Using this standard process, we have trained the PatentGPT series models based on open-source pretrained models. By evaluating them on the open-source IP-oriented benchmark MOZIP, our domain-specific LLMs outperforms GPT-4, indicating the effectiveness of the proposed training procedure and the expertise of the PatentGPT models in the IP domain. Remarkably, our model surpassed GPT-4 on the 2019 China Patent Agent Qualification Examination, scoring 65 and matching human expert levels. Additionally, the PatentGPT model, which utilizes the SMoE architecture, achieves performance comparable to that of GPT-4 in the IP domain and demonstrates a better cost-performance ratio on long-text tasks, potentially serving as an alternative to GPT-4 within the IP domain. △ Less

Submitted 4 June, 2024; v1 submitted 28 April, 2024; originally announced April 2024.

Comments: 19 pages, 9 figures

ACM Class: I.2.7

arXiv:2404.15532 [pdf, other]

BattleAgent: Multi-modal Dynamic Emulation on Historical Battles to Complement Historical Analysis

Authors: Shuhang Lin, Wenyue Hua, Lingyao Li, Che-Jui Chang, Lizhou Fan, Jianchao Ji, Hang Hua, Mingyu Jin, Jiebo Luo, Yongfeng Zhang

Abstract: This paper presents BattleAgent, an emulation system that combines the Large Vision-Language Model and Multi-agent System. This novel system aims to simulate complex dynamic interactions among multiple agents, as well as between agents and their environments, over a period of time. It emulates both the decision-making processes of leaders and the viewpoints of ordinary participants, such as soldie… ▽ More This paper presents BattleAgent, an emulation system that combines the Large Vision-Language Model and Multi-agent System. This novel system aims to simulate complex dynamic interactions among multiple agents, as well as between agents and their environments, over a period of time. It emulates both the decision-making processes of leaders and the viewpoints of ordinary participants, such as soldiers. The emulation showcases the current capabilities of agents, featuring fine-grained multi-modal interactions between agents and landscapes. It develops customizable agent structures to meet specific situational requirements, for example, a variety of battle-related activities like scouting and trench digging. These components collaborate to recreate historical events in a lively and comprehensive manner while offering insights into the thoughts and feelings of individuals from diverse viewpoints. The technological foundations of BattleAgent establish detailed and immersive settings for historical battles, enabling individual agents to partake in, observe, and dynamically respond to evolving battle scenarios. This methodology holds the potential to substantially deepen our understanding of historical events, particularly through individual accounts. Such initiatives can also aid historical research, as conventional historical narratives often lack documentation and prioritize the perspectives of decision-makers, thereby overlooking the experiences of ordinary individuals. BattelAgent illustrates AI's potential to revitalize the human aspect in crucial social events, thereby fostering a more nuanced collective understanding and driving the progressive development of human society. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 26 pages, 14 figures The data and code for this project are accessible at https://github.com/agiresearch/battleagent

arXiv:2404.14715 [pdf, other]

FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Authors: Hang Hua, Jing Shi, Kushal Kafle, Simon Jenni, Daoan Zhang, John Collomosse, Scott Cohen, Jiebo Luo

Abstract: Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To add… ▽ More Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.12353 [pdf, other]

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Authors: Hang Hua, Yunlong Tang, Chenliang Xu, Jiebo Luo

Abstract: Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective fine-tuning of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooki… ▽ More Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective fine-tuning of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39\%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2402.00827

Emo-Avatar: Efficient Monocular Video Style Avatar through Texture Rendering

Authors: Pinxin Liu, Luchuan Song, Daoan Zhang, Hang Hua, Yunlong Tang, Huaijin Tu, Jiebo Luo, Chenliang Xu

Abstract: Artistic video portrait generation is a significant and sought-after task in the fields of computer graphics and vision. While various methods have been developed that integrate NeRFs or StyleGANs with instructional editing models for creating and editing drivable portraits, these approaches face several challenges. They often rely heavily on large datasets, require extensive customization process… ▽ More Artistic video portrait generation is a significant and sought-after task in the fields of computer graphics and vision. While various methods have been developed that integrate NeRFs or StyleGANs with instructional editing models for creating and editing drivable portraits, these approaches face several challenges. They often rely heavily on large datasets, require extensive customization processes, and frequently result in reduced image quality. To address the above problems, we propose the Efficient Monotonic Video Style Avatar (Emo-Avatar) through deferred neural rendering that enhances StyleGAN's capacity for producing dynamic, drivable portrait videos. We proposed a two-stage deferred neural rendering pipeline. In the first stage, we utilize few-shot PTI initialization to initialize the StyleGAN generator through several extreme poses sampled from the video to capture the consistent representation of aligned faces from the target portrait. In the second stage, we propose a Laplacian pyramid for high-frequency texture sampling from UV maps deformed by dynamic flow of expression for motion-aware texture prior integration to provide torso features to enhance StyleGAN's ability to generate complete and upper body for portrait video rendering. Emo-Avatar reduces style customization time from hours to merely 5 minutes compared with existing methods. In addition, Emo-Avatar requires only a single reference image for editing and employs region-aware contrastive learning with semantic invariant CLIP guidance, ensuring consistent high-resolution output and identity preservation. Through both quantitative and qualitative assessments, Emo-Avatar demonstrates superior performance over existing methods in terms of training efficiency, rendering quality and editability in self- and cross-reenactment. △ Less

Submitted 14 March, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: The paper paper needs a big modification, including the tile. This work is no longer its original version

arXiv:2310.17661 [pdf, other]

An Overview on IEEE 802.11bf: WLAN Sensing

Authors: Rui Du, Haocheng Hua, Hailiang Xie, Xianxin Song, Zhonghao Lyu, Mengshi Hu, Narengerile, Yan Xin, Stephen McCann, Michael Montemurro, Tony Xiao Han, Jie Xu

Abstract: With recent advancements, the wireless local area network (WLAN) or wireless fidelity (Wi-Fi) technology has been successfully utilized to realize sensing functionalities such as detection, localization, and recognition. However, the WLANs standards are developed mainly for the purpose of communication, and thus may not be able to meet the stringent requirements for emerging sensing applications.… ▽ More With recent advancements, the wireless local area network (WLAN) or wireless fidelity (Wi-Fi) technology has been successfully utilized to realize sensing functionalities such as detection, localization, and recognition. However, the WLANs standards are developed mainly for the purpose of communication, and thus may not be able to meet the stringent requirements for emerging sensing applications. To resolve this issue, a new Task Group (TG), namely IEEE 802.11bf, has been established by the IEEE 802.11 working group, with the objective of creating a new amendment to the WLAN standard to meet advanced sensing requirements while minimizing the effect on communications. This paper provides a comprehensive overview on the up-to-date efforts in the IEEE 802.11bf TG. First, we introduce the definition of the 802.11bf amendment and its formation and standardization timeline. Next, we discuss the WLAN sensing use cases with the corresponding key performance indicator (KPI) requirements. After reviewing previous WLAN sensing research based on communication-oriented WLAN standards, we identify their limitations and underscore the practical need for the new sensing-oriented amendment in 802.11bf. Furthermore, we discuss the WLAN sensing framework and procedure used for measurement acquisition, by considering both sensing at sub-7GHz and directional multi-gigabit (DMG) sensing at 60 GHz, respectively, and address their shared features, similarities, and differences. In addition, we present various candidate technical features for IEEE 802.11bf, including waveform/sequence design, feedback types, as well as quantization and compression techniques. We also describe the methodologies and the channel modeling used by the IEEE 802.11bf TG for evaluation. Finally, we discuss the challenges and future research directions to motivate more research endeavors towards this field in details. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: 31 pages, 25 figures, this is a significant updated version of arXiv:2207.04859

arXiv:2309.11827 [pdf, other]

The Impact of Silence on Speech Anti-Spoofing

Authors: Yuxiang Zhang, Zhuo Li, Jingze Lu, Hua Hua, Wenchao Wang, Pengyuan Zhang

Abstract: The current speech anti-spoofing countermeasures (CMs) show excellent performance on specific datasets. However, removing the silence of test speech through Voice Activity Detection (VAD) can severely degrade performance. In this paper, the impact of silence on speech anti-spoofing is analyzed. First, the reasons for the impact are explored, including the proportion of silence duration and the con… ▽ More The current speech anti-spoofing countermeasures (CMs) show excellent performance on specific datasets. However, removing the silence of test speech through Voice Activity Detection (VAD) can severely degrade performance. In this paper, the impact of silence on speech anti-spoofing is analyzed. First, the reasons for the impact are explored, including the proportion of silence duration and the content of silence. The proportion of silence duration in spoof speech generated by text-to-speech (TTS) algorithms is lower than that in bonafide speech. And the content of silence generated by different waveform generators varies compared to bonafide speech. Then the impact of silence on model prediction is explored. Even after retraining, the spoof speech generated by neural network based end-to-end TTS algorithms suffers a significant rise in error rates when the silence is removed. To demonstrate the reasons for the impact of silence on CMs, the attention distribution of a CM is visualized through class activation mapping (CAM). Furthermore, the implementation and analysis of the experiments masking silence or non-silence demonstrates the significance of the proportion of silence duration for detecting TTS and the importance of silence content for detecting voice conversion (VC). Based on the experimental results, improving the robustness of CMs against unknown spoofing attacks by masking silence is also proposed. Finally, the attacks on anti-spoofing CMs through concatenating silence, and the mitigation of VAD and silence attack through low-pass filtering are introduced. △ Less

Submitted 21 September, 2023; originally announced September 2023.

Comments: 16 pages, 9 figures, 13 tables

arXiv:2308.16130 [pdf, other]

Near-Field 3D Localization via MIMO Radar: Cramér-Rao Bound Analysis and Estimator Design

Authors: Haocheng Hua, Jie Xu, Yonina C. Eldar

Abstract: This paper studies a near-field multiple-input multiple-output (MIMO) radar sensing system, in which the transceivers with massive antennas aim to localize multiple near-field targets in the three-dimensional (3D) space over unknown cluttered environments. We consider a spherical wavefront propagation with both channel phase and amplitude variations over different antennas. Under this setup, the u… ▽ More This paper studies a near-field multiple-input multiple-output (MIMO) radar sensing system, in which the transceivers with massive antennas aim to localize multiple near-field targets in the three-dimensional (3D) space over unknown cluttered environments. We consider a spherical wavefront propagation with both channel phase and amplitude variations over different antennas. Under this setup, the unknown parameters include the 3D coordinates and complex reflection coefficients of the targets, as well as the noise and interference covariance matrix. First, by considering general transmit signal waveforms, we derive the Fisher information matrix (FIM) corresponding to the 3D coordinates and the complex reflection coefficients of the targets and accordingly obtain the Cramér-Rao bound (CRB) for the 3D coordinates. This provides a performance bound for 3D near-field target localization. For the special single-target case, we obtain the CRB in an analytical form, and analyze its asymptotic scaling behaviors with respect to the target distance and antenna size of the transceiver. Next, to facilitate practical localization, we propose two estimators to localize targets based on the maximum likelihood (ML) criterion, namely the 3D approximate cyclic optimization (3D-ACO) and the 3D cyclic optimization with white Gaussian noise (3D-CO-WGN), respectively. Numerical results validate the asymptotic CRB analysis and show that the consideration of varying channel amplitudes is vital to achieve accurate CRB and localization when the targets are close to the transceivers. It is also shown that the proposed estimators achieve localization performance close to the derived CRB under various cluttered environments, thus validating their effectiveness in practical implementation. Furthermore, it is shown that transmit waveforms have a significant impact on CRB and the localization performance. △ Less

Submitted 30 August, 2023; originally announced August 2023.

Comments: 13 pages (14 pages in Arxiv version..), 16 figures, submitted for journal publication. arXiv admin note: substantial text overlap with arXiv:2305.10986

arXiv:2308.13365 [pdf, ps, other]

Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder

Authors: Xuyuan Li, Zengqiang Shang, Peiyang Shi, Hua Hua, Ta Li, Pengyuan Zhang

Abstract: Neural networks have been able to generate high-quality single-sentence speech. However, it remains a challenge concerning audio-book speech synthesis due to the intra-paragraph correlation of semantic and acoustic features as well as variable styles. In this paper, we propose a highly expressive paragraph speech synthesis system with a multi-step variational autoencoder, called EP-MSTTS. EP-MSTTS… ▽ More Neural networks have been able to generate high-quality single-sentence speech. However, it remains a challenge concerning audio-book speech synthesis due to the intra-paragraph correlation of semantic and acoustic features as well as variable styles. In this paper, we propose a highly expressive paragraph speech synthesis system with a multi-step variational autoencoder, called EP-MSTTS. EP-MSTTS is the first VITS-based paragraph speech synthesis model and models the variable style of paragraph speech at five levels: frame, phoneme, word, sentence, and paragraph. We also propose a series of improvements to enhance the performance of this hierarchical model. In addition, we directly train EP-MSTTS on speech sliced by paragraph rather than sentence. Experiment results on the single-speaker French audiobook corpus released at Blizzard Challenge 2023 show EP-MSTTS obtains better performance than baseline models. △ Less

Submitted 11 June, 2024; v1 submitted 25 August, 2023; originally announced August 2023.

Comments: accepted at Interspeech 2024

arXiv:2305.10986 [pdf, other]

Near-Field 3D Localization via MIMO Radar: Cramér-Rao Bound and Estimator Design

Authors: Haocheng Hua, Jie Xu

Abstract: Future sixth-generation (6G) networks are envisioned to provide both sensing and communications functionalities by using densely deployed base stations (BSs) with massive antennas operating in millimeter wave (mmWave) and terahertz (THz). Due to the large number of antennas and the high frequency band, the sensing and communications will operate within the near-field region, thus making the conven… ▽ More Future sixth-generation (6G) networks are envisioned to provide both sensing and communications functionalities by using densely deployed base stations (BSs) with massive antennas operating in millimeter wave (mmWave) and terahertz (THz). Due to the large number of antennas and the high frequency band, the sensing and communications will operate within the near-field region, thus making the conventional designs based on the far-field channel models inapplicable. This paper studies a near-field multiple-input-multiple-output (MIMO) radar sensing system, in which the transceivers with massive antennas aim to localize multiple near-field targets in the three-dimensional (3D) space. In particular, we adopt a general wavefront propagation model by considering the exact spherical wavefront with both channel phase and amplitude variations over different antennas. Besides, we consider the general transmit signal waveforms and also consider the unknown cluttered environments. Under this setup, the unknown parameters to estimate include the 3D coordinates and the complex reflection coefficients of the multiple targets, as well as the noise and interference covariance matrix. Accordingly, we derive the Cramér-Rao bound (CRB) for estimating the target coordinates. Next, to facilitate practical localization, we propose an efficient estimator based on the 3D approximate cyclic optimization (3D-ACO), which is obtained following the maximum likelihood (ML) criterion. Finally, numerical results show that considering the exact antenna-varying channel amplitudes achieves more accurate CRB as compared to prior works based on constant channel amplitudes across antennas, especially when the targets are close to the transceivers. It is also shown that the proposed estimator achieves localization performance close to the derived CRB, thus validating its superior performance. △ Less

Submitted 15 August, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: 8 pages, 4 figures as an extended version. Its 6 pages version has been accepted for presentation in IEEE Globecom 2023 Symposia

arXiv:2303.12060 [pdf, other]

VideoXum: Cross-modal Visual and Textural Summarization of Videos

Authors: Jingyang Lin, Hang Hua, Ming Chen, Yikang Li, Jenhao Hsiao, Chiuman Ho, Jiebo Luo

Abstract: Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint vid… ▽ More Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research. △ Less

Submitted 23 April, 2024; v1 submitted 21 March, 2023; originally announced March 2023.

Comments: 13 pages, 7 figures

Journal ref: IEEE Transactions on Multimedia, VOL. 26 (2024) 5548-5560

arXiv:2211.10605 [pdf, other]

ISAC Meets SWIPT: Multi-functional Wireless Systems Integrating Sensing, Communication, and Powering

Authors: Yilong Chen, Haocheng Hua, Jie Xu, Derrick Wing Kwan Ng

Abstract: This paper unifies integrated sensing and communication (ISAC) and simultaneous wireless information and power transfer (SWIPT), by investigating a new multi-functional multiple-input multiple-output (MIMO) system integrating wireless sensing, communication, and powering. In this system, one multi-antenna hybrid access point (H-AP) transmits wireless signals to communicate with one multi-antenna i… ▽ More This paper unifies integrated sensing and communication (ISAC) and simultaneous wireless information and power transfer (SWIPT), by investigating a new multi-functional multiple-input multiple-output (MIMO) system integrating wireless sensing, communication, and powering. In this system, one multi-antenna hybrid access point (H-AP) transmits wireless signals to communicate with one multi-antenna information decoding (ID) receiver, wirelessly charge one multi-antenna energy harvesting (EH) receiver, and perform radar target sensing based on the echo signal at the same time. Under this setup, we aim to reveal the fundamental performance tradeoff limits among sensing, communication, and powering, in terms of the estimation Cramer-Rao bound (CRB), achievable communication rate, and harvested energy level, respectively. In particular, we consider two different target models for radar sensing, namely the point and extended targets, for which we are interested in estimating the target angle and the complete target response matrix, respectively. For both models, we define the achievable CRB-rate-energy (C-R-E) region and characterize its Pareto boundary by maximizing the achievable rate at the ID receiver, subject to the estimation CRB requirement for target sensing, the harvested energy requirement at the EH receiver, and the maximum transmit power constraint at the H-AP. We obtain the well-structured optimal transmit covariance solutions to the two formulated problems by applying advanced convex optimization techniques. Numerical results show the optimal C-R-E region boundary achieved by our proposed design, as compared to the benchmark schemes based on time switching and eigenmode transmission (EMT). △ Less

Submitted 16 August, 2023; v1 submitted 19 November, 2022; originally announced November 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2210.16716

arXiv:2211.09699 [pdf, other]

PromptCap: Prompt-Guided Task-Aware Image Captioning

Authors: Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, Jiebo Luo

Abstract: Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their strong knowledge retrieval and reasoning capabilities. To enable LM to understand images, prior work uses a captioning model to convert images into text. However,… ▽ More Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their strong knowledge retrieval and reasoning capabilities. To enable LM to understand images, prior work uses a captioning model to convert images into text. However, when summarizing an image in a single caption sentence, which visual entities to describe are often underspecified. Generic image captions often miss visual details essential for the LM to answer visual questions correctly. To address this challenge, we propose PromptCap (Prompt-guided image Captioning), a captioning model designed to serve as a better connector between images and black-box LMs. Different from generic captions, PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. The prompt contains a question that the caption should aid in answering. To avoid extra annotation, PromptCap is trained by examples synthesized with GPT-3 and existing datasets. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA. PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA). Zero-shot results on WebQA show that PromptCap generalizes well to unseen domains. △ Less

Submitted 17 August, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

Comments: Accepted to ICCV 2023

arXiv:2210.16716 [pdf, other]

Transmit Optimization for Multi-functional MIMO Systems Integrating Sensing, Communication, and Powering

Authors: Yilong Chen, Haocheng Hua, Jie Xu

Abstract: This paper unifies integrated sensing and communication (ISAC) and simultaneous wireless information and power transfer (SWIPT), by investigating a new multi-functional multiple-input multiple-output (MIMO) system integrating wireless sensing, communication, and powering. In this system, one multi-antenna hybrid access point (H-AP) transmits wireless signals to communicate with one multi-antenna i… ▽ More This paper unifies integrated sensing and communication (ISAC) and simultaneous wireless information and power transfer (SWIPT), by investigating a new multi-functional multiple-input multiple-output (MIMO) system integrating wireless sensing, communication, and powering. In this system, one multi-antenna hybrid access point (H-AP) transmits wireless signals to communicate with one multi-antenna information decoding (ID) receiver, wirelessly charge one multi-antenna energy harvesting (EH) receiver, and perform radar sensing for a point target based on the echo signal at the same time. Under this setup, we aim to reveal the fundamental performance tradeoff limits of sensing, communication, and powering, in terms of the estimation Cram{é}r-Rao bound (CRB), achievable communication rate, and harvested energy level, respectively. Towards this end, we define the achievable CRB-rate-energy (C-R-E) region and characterize its Pareto boundary by maximizing the achievable rate at the ID receiver, subject to the estimation CRB requirement for target sensing, the harvested energy requirement at the EH receiver, and the maximum transmit power constraint at the H-AP. We obtain the semi-closed-form optimal transmit covariance solution to the formulated problem by applying advanced convex optimization techniques. Numerical results show the optimal C-R-E region boundary achieved by our proposed design, as compared to the benchmark scheme based on time switching. △ Less

Submitted 29 October, 2022; originally announced October 2022.

Comments: 7 pages,4 figures, ICC-WC 2023

arXiv:2210.14229 [pdf, other]

Causal Information Bottleneck Boosts Adversarial Robustness of Deep Neural Network

Authors: Huan Hua, Jun Yan, Xi Fang, Weiquan Huang, Huilin Yin, Wancheng Ge

Abstract: The information bottleneck (IB) method is a feasible defense solution against adversarial attacks in deep learning. However, this method suffers from the spurious correlation, which leads to the limitation of its further improvement of adversarial robustness. In this paper, we incorporate the causal inference into the IB framework to alleviate such a problem. Specifically, we divide the features o… ▽ More The information bottleneck (IB) method is a feasible defense solution against adversarial attacks in deep learning. However, this method suffers from the spurious correlation, which leads to the limitation of its further improvement of adversarial robustness. In this paper, we incorporate the causal inference into the IB framework to alleviate such a problem. Specifically, we divide the features obtained by the IB method into robust features (content information) and non-robust features (style information) via the instrumental variables to estimate the causal effects. With the utilization of such a framework, the influence of non-robust features could be mitigated to strengthen the adversarial robustness. We make an analysis of the effectiveness of our proposed method. The extensive experiments in MNIST, FashionMNIST, and CIFAR-10 show that our method exhibits the considerable robustness against multiple adversarial attacks. Our code would be released. △ Less

Submitted 25 October, 2022; originally announced October 2022.

arXiv:2209.12721 [pdf, other]

MIMO Integrated Sensing and Communication: CRB-Rate Tradeoff

Authors: Haocheng Hua, Tony Xiao Han, Jie Xu

Abstract: This paper studies a multiple-input multiple-output (MIMO) integrated sensing and communication (ISAC) system, in which a multi-antenna base station (BS) sends unified wireless signals to estimate one sensing target and communicate with a multi-antenna communication user (CU) simultaneously. We consider both the point and extended target models. For the point target case, the BS estimates the targ… ▽ More This paper studies a multiple-input multiple-output (MIMO) integrated sensing and communication (ISAC) system, in which a multi-antenna base station (BS) sends unified wireless signals to estimate one sensing target and communicate with a multi-antenna communication user (CU) simultaneously. We consider both the point and extended target models. For the point target case, the BS estimates the target angle and we adopt the Cramér-Rao bound (CRB) for angle estimation as the sensing performance metric. For the extended target case, the BS estimates the complete target response matrix, and we consider three different sensing performance metrics including the trace, the maximum eigenvalue, and the determinant of the CRB matrix for target response matrix estimation. For each of the four scenarios with different CRB measures, we investigate the fundamental tradeoff between the CRB for estimation and the data rate for communication, by characterizing the Pareto boundary of the achievable CRB-rate (C-R) region. In particular, we formulate a new MIMO rate maximization problem for each scenario, by optimizing the transmit covariance matrix at the BS, subject to a different form of maximum CRB constraint and its maximum transmit power constraint. For these problems, we obtain their optimal solutions in semi-closed forms by using advanced convex optimization techniques. For the point target case, the optimal solution is obtained by diagonalizing a \emph{composite channel matrix} via singular value decomposition (SVD) together with water-filling-like power allocation over these decomposed subchannels. For the three scenarios in the extended target case, the optimal solutions are obtained by diagonalizing the \emph{communication channel} via SVD, together with proper power allocation over two orthogonal sets of subchannels. Numerical results are conducted to validate the proposed design. △ Less

Submitted 26 September, 2022; originally announced September 2022.

Comments: 30 pages, 17 figures, submitted for journal publication

arXiv:2208.14447 [pdf, ps, other]

A further exploration of deep Multi-Agent Reinforcement Learning with Hybrid Action Space

Authors: Hongzhi Hua, Guixuan Wen, Kaigui Wu

Abstract: The research of extending deep reinforcement learning (drl) to multi-agent field has solved many complicated problems and made great achievements. However, almost all these studies only focus on discrete or continuous action space and there are few works having ever used multi-agent deep reinforcement learning to real-world environment problems which mostly have a hybrid action space. Therefore, i… ▽ More The research of extending deep reinforcement learning (drl) to multi-agent field has solved many complicated problems and made great achievements. However, almost all these studies only focus on discrete or continuous action space and there are few works having ever used multi-agent deep reinforcement learning to real-world environment problems which mostly have a hybrid action space. Therefore, in this paper, we propose two algorithms: deep multi-agent hybrid soft actor-critic (MAHSAC) and multi-agent hybrid deep deterministic policy gradients (MAHDDPG) to fill this gap. This two algorithms follow the centralized training and decentralized execution (CTDE) paradigm and could handle hybrid action space problems. Our experiences are running on multi-agent particle environment which is an easy multi-agent particle world, along with some basic simulated physics. The experimental results show that these algorithms have good performances. △ Less

Submitted 30 August, 2022; originally announced August 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2206.05108

arXiv:2206.05658 [pdf, other]

doi 10.1109/TNNLS.2023.3330926

Improving Pre-trained Language Model Fine-tuning with Noise Stability Regularization

Authors: Hang Hua, Xingjian Li, Dejing Dou, Cheng-Zhong Xu, Jiebo Luo

Abstract: The advent of large-scale pre-trained language models has contributed greatly to the recent progress in natural language processing. Many state-of-the-art language models are first trained on a large text corpus and then fine-tuned on downstream tasks. Despite its recent success and wide adoption, fine-tuning a pre-trained language model often suffers from overfitting, which leads to poor generali… ▽ More The advent of large-scale pre-trained language models has contributed greatly to the recent progress in natural language processing. Many state-of-the-art language models are first trained on a large text corpus and then fine-tuned on downstream tasks. Despite its recent success and wide adoption, fine-tuning a pre-trained language model often suffers from overfitting, which leads to poor generalizability due to the extremely high complexity of the model and the limited training samples from downstream tasks. To address this problem, we propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR). Specifically, we propose to inject the standard Gaussian noise or In-manifold noise and regularize hidden representations of the fine-tuned model. We first provide theoretical analyses to support the efficacy of our method. We then demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART. While these previous works only verify the effectiveness of their methods on relatively simple text classification tasks, we also verify the effectiveness of our method on question answering tasks, where the target problem is much more difficult and more training examples are available. Furthermore, extensive experimental results indicate that the proposed algorithm can not only enhance the in-domain performance of the language models but also improve the domain generalization performance on out-of-domain data. △ Less

Submitted 8 November, 2023; v1 submitted 12 June, 2022; originally announced June 2022.

Comments: Accepted by TNNLS

arXiv:2206.05108 [pdf, ps, other]

Deep Multi-Agent Reinforcement Learning with Hybrid Action Spaces based on Maximum Entropy

Authors: Hongzhi Hua, Kaigui Wu, Guixuan Wen

Abstract: Multi-agent deep reinforcement learning has been applied to address a variety of complex problems with either discrete or continuous action spaces and achieved great success. However, most real-world environments cannot be described by only discrete action spaces or only continuous action spaces. And there are few works having ever utilized deep reinforcement learning (drl) to multi-agent problems… ▽ More Multi-agent deep reinforcement learning has been applied to address a variety of complex problems with either discrete or continuous action spaces and achieved great success. However, most real-world environments cannot be described by only discrete action spaces or only continuous action spaces. And there are few works having ever utilized deep reinforcement learning (drl) to multi-agent problems with hybrid action spaces. Therefore, we propose a novel algorithm: Deep Multi-Agent Hybrid Soft Actor-Critic (MAHSAC) to fill this gap. This algorithm follows the centralized training but decentralized execution (CTDE) paradigm, and extend the Soft Actor-Critic algorithm (SAC) to handle hybrid action space problems in Multi-Agent environments based on maximum entropy. Our experiences are running on an easy multi-agent particle world with a continuous observation and discrete action space, along with some basic simulated physics. The experimental results show that MAHSAC has good performance in training speed, stability, and anti-interference ability. At the same time, it outperforms existing independent deep hybrid learning method in cooperative scenarios and competitive scenarios. △ Less

Submitted 10 June, 2022; originally announced June 2022.

arXiv:2205.14050 [pdf, other]

MIMO Integrated Sensing and Communication with Extended Target: CRB-Rate Tradeoff

Authors: Haocheng Hua, Xianxin Song, Yuan Fang, Tony Xiao Han, Jie Xu

Abstract: This paper studies a multiple-input multiple-output (MIMO) integrated sensing and communication (ISAC) system, in which a multi-antenna base station (BS) sends unified wireless signals to estimate an extended target and communicate with a multi-antenna communication user (CU) at the same time. We investigate the fundamental tradeoff between the estimation Cramér-Rao bound (CRB) for sensing and the… ▽ More This paper studies a multiple-input multiple-output (MIMO) integrated sensing and communication (ISAC) system, in which a multi-antenna base station (BS) sends unified wireless signals to estimate an extended target and communicate with a multi-antenna communication user (CU) at the same time. We investigate the fundamental tradeoff between the estimation Cramér-Rao bound (CRB) for sensing and the data rate for communication, by characterizing the Pareto boundary of the achievable CRB-rate (C-R) region. Towards this end, we formulate a new MIMO rate maximization problem by optimizing the transmit covariance matrix at the BS, subject to a new form of maximum CRB constraint together with a maximum transmit power constraint. We derive the optimal transmit covariance solution in a semi-closed form, by first implementing the singular-value decomposition (SVD) to diagonalize the communication channel and then properly allocating the transmit power over these subchannels for communication and other orthogonal subchannels (if any) for dedicated sensing. It is shown that the optimal transmit covariance is of full rank, which unifies the conventional rate maximization design with water-filling power allocation and the CRB minimization design with isotropic transmission. Numerical results are provided to validate the performance achieved by our proposed optimal design, in comparison with other benchmark schemes. △ Less

Submitted 17 August, 2022; v1 submitted 27 May, 2022; originally announced May 2022.

arXiv:2201.12567 [pdf, other]

The HCCL-DKU system for fake audio generation task of the 2022 ICASSP ADD Challenge

Authors: Ziyi Chen, Hua Hua, Yuxiang Zhang, Ming Li, Pengyuan Zhang

Abstract: The voice conversion task is to modify the speaker identity of continuous speech while preserving the linguistic content. Generally, the naturalness and similarity are two main metrics for evaluating the conversion quality, which has been improved significantly in recent years. This paper presents the HCCL-DKU entry for the fake audio generation task of the 2022 ICASSP ADD challenge. We propose a… ▽ More The voice conversion task is to modify the speaker identity of continuous speech while preserving the linguistic content. Generally, the naturalness and similarity are two main metrics for evaluating the conversion quality, which has been improved significantly in recent years. This paper presents the HCCL-DKU entry for the fake audio generation task of the 2022 ICASSP ADD challenge. We propose a novel ppg-based voice conversion model that adopts a fully end-to-end structure. Experimental results show that the proposed method outperforms other conversion models, including Tacotron-based and Fastspeech-based models, on conversion quality and spoofing performance against anti-spoofing systems. In addition, we investigate several post-processing methods for better spoofing power. Finally, we achieve second place with a deception success rate of 0.916 in the ADD challenge. △ Less

Submitted 29 January, 2022; originally announced January 2022.

arXiv:2107.04835 [pdf, other]

Noise Stability Regularization for Improving BERT Fine-tuning

Authors: Hang Hua, Xingjian Li, Dejing Dou, Cheng-Zhong Xu, Jiebo Luo

Abstract: Fine-tuning pre-trained language models such as BERT has become a common practice dominating leaderboards across various NLP tasks. Despite its recent success and wide adoption, this process is unstable when there are only a small number of training samples available. The brittleness of this process is often reflected by the sensitivity to random seeds. In this paper, we propose to tackle this pro… ▽ More Fine-tuning pre-trained language models such as BERT has become a common practice dominating leaderboards across various NLP tasks. Despite its recent success and wide adoption, this process is unstable when there are only a small number of training samples available. The brittleness of this process is often reflected by the sensitivity to random seeds. In this paper, we propose to tackle this problem based on the noise stability property of deep nets, which is investigated in recent literature (Arora et al., 2018; Sanyal et al., 2020). Specifically, we introduce a novel and effective regularization method to improve fine-tuning on NLP tasks, referred to as Layer-wise Noise Stability Regularization (LNSR). We extend the theories about adding noise to the input and prove that our method gives a stabler regularization effect. We provide supportive evidence by experimentally confirming that well-performing models show a low sensitivity to noise and fine-tuning with LNSR exhibits clearly higher generalizability and stability. Furthermore, our method also demonstrates advantages over other state-of-the-art algorithms including L2-SP (Li et al., 2018), Mixout (Lee et al., 2020) and SMART (Jiang et al., 2020). △ Less

Submitted 10 July, 2021; originally announced July 2021.

Comments: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

arXiv:2104.11871 [pdf, other]

Optimal Transmit Beamforming for Integrated Sensing and Communication

Authors: Haocheng Hua, Jie Xu, Tony Xiao Han

Abstract: This paper studies the transmit beamforming in a downlink integrated sensing and communication (ISAC) system, where a base station (BS) equipped with a uniform linear array (ULA) sends combined information-bearing and dedicated radar signals to simultaneously perform downlink multiuser communication and radar target sensing. Under this setup, we maximize the radar sensing performance (in terms of… ▽ More This paper studies the transmit beamforming in a downlink integrated sensing and communication (ISAC) system, where a base station (BS) equipped with a uniform linear array (ULA) sends combined information-bearing and dedicated radar signals to simultaneously perform downlink multiuser communication and radar target sensing. Under this setup, we maximize the radar sensing performance (in terms of minimizing the beampattern matching errors or maximizing the minimum weighted beampattern gains), subject to the communication users' minimum signal-to-interference-plus-noise ratio (SINR) requirements and the BS's transmit power constraints. In particular, we consider two types of communication receivers, namely Type-I and Type-II receivers, which do not have and do have the capability of cancelling the interference from the {\emph{a-priori}} known dedicated radar signals, respectively. Under both Type-I and Type-II receivers, the beampattern matching and minimum weighted beampattern gain maximization problems are globally optimally solved via applying the semidefinite relaxation (SDR) technique together with the rigorous proof of the tightness of SDR for both Type-I and Type-II receivers under the two design criteria. It is shown that at the optimality, radar signals are not required with Type-I receivers under some specific conditions, while radar signals are always needed to enhance the performance with Type-II receivers. Numerical results show that the minimum weighted beampattern gain maximization leads to significantly higher beampattern gains at the worst-case sensing angles with a much lower computational complexity than the beampattern matching design. We show that by exploiting the capability of canceling the interference caused by the radar signals, the case with Type-II receivers results in better sensing performance than that with Type-I receivers and other conventional designs. △ Less

Submitted 24 March, 2023; v1 submitted 23 April, 2021; originally announced April 2021.

Comments: Accepted by IEEE Transactions on Vehicular Technology

arXiv:2006.00163 [pdf, ps, other]

Tracking Public Opinion in China through Various Stages of the COVID-19 Pandemic

Authors: Yuqi Gao, Hang Hua, Jiebo Luo

Abstract: In recent months, COVID-19 has become a global pandemic and had a huge impact on the world. People under different conditions have very different attitudes toward the epidemic. Due to the real-time and large-scale nature of social media, we can continuously obtain a massive amount of public opinion information related to the epidemic from social media. In particular, researchers may ask questions… ▽ More In recent months, COVID-19 has become a global pandemic and had a huge impact on the world. People under different conditions have very different attitudes toward the epidemic. Due to the real-time and large-scale nature of social media, we can continuously obtain a massive amount of public opinion information related to the epidemic from social media. In particular, researchers may ask questions such as "how is the public reacting to COVID-19 in China during different stages of the pandemic?", "what factors affect the public opinion orientation in China?", and so on. To answer such questions, we analyze the pandemic related public opinion information on Weibo, China's largest social media platform. Specifically, we have first collected a large amount of COVID-19-related public opinion microblogs. We then use a sentiment classifier to recognize and analyze different groups of users' opinions. In the collected sentiment orientated microblogs, we try to track the public opinion through different stages of the COVID-19 pandemic. Furthermore, we analyze more key factors that might have an impact on the public opinion of COVID-19 (e.g., users in different provinces or users with different education levels). Empirical results show that the public opinions vary along with the key factors of COVID-19. Furthermore, we analyze the public attitudes on different public-concerning topics, such as staying at home and quarantine. △ Less

Submitted 1 June, 2020; v1 submitted 29 May, 2020; originally announced June 2020.

arXiv:1907.02506 [pdf]

doi 10.1016/j.future.2019.03.011

Security modeling and efficient computation offloading for service workflow in mobile edge computing

Authors: Binbin Huang, Zhongjin Lia, Peng Tang, Shangguang Wang, Jun Zhao, Haiyang Hua, Wanqing Lia, Victor Chang

Abstract: It is a big challenge for resource-limited mobile devices (MDs) to execute various complex and energy-consumed mobile applications. Fortunately, as a novel computing paradigm, edge computing (MEC) can provide abundant computing resources to execute all or parts of the tasks of MDs and thereby can greatly reduce the energy of MD and improve the QoS of applications. However, offloading workflow task… ▽ More It is a big challenge for resource-limited mobile devices (MDs) to execute various complex and energy-consumed mobile applications. Fortunately, as a novel computing paradigm, edge computing (MEC) can provide abundant computing resources to execute all or parts of the tasks of MDs and thereby can greatly reduce the energy of MD and improve the QoS of applications. However, offloading workflow tasks to the MEC servers are liable to external security threats (e.g., snooping, alteration). In this paper, we propose a security and energy efficient computation offloading (SEECO) strategy for service workflows in MEC environment, the goal of which is to optimize the energy consumption under the risk probability and deadline constraints. First, we build a security overhead model to measure the execution time of security services. Then, we formulate the computation offloading problem by incorporating the security, energy consumption and execution time of workflow application. Finally, based on the genetic algorithm (GA), the corresponding coding strategies of SEECO are devised by considering tasks execution order and location and security services selection. Extensive experiments with the variety of workflow parameters demonstrate that SEECO strategy can achieve the security and energy efficiency for the mobile applications. △ Less

Submitted 4 July, 2019; originally announced July 2019.

Comments: published in journal "Future Generation Computer Systems": https://doi.org/10.1016/j.future.2019.03.011

MSC Class: mobile edge computing; workflow scheduling; security modeling; energy efficient; genetic algorithm (GA)

arXiv:1905.12926 [pdf, other]

Controllable Unsupervised Text Attribute Transfer via Editing Entangled Latent Representation

Authors: Ke Wang, Hang Hua, Xiaojun Wan

Abstract: Unsupervised text attribute transfer automatically transforms a text to alter a specific attribute (e.g. sentiment) without using any parallel data, while simultaneously preserving its attribute-independent content. The dominant approaches are trying to model the content-independent attribute separately, e.g., learning different attributes' representations or using multiple attribute-specific deco… ▽ More Unsupervised text attribute transfer automatically transforms a text to alter a specific attribute (e.g. sentiment) without using any parallel data, while simultaneously preserving its attribute-independent content. The dominant approaches are trying to model the content-independent attribute separately, e.g., learning different attributes' representations or using multiple attribute-specific decoders. However, it may lead to inflexibility from the perspective of controlling the degree of transfer or transferring over multiple aspects at the same time. To address the above problems, we propose a more flexible unsupervised text attribute transfer framework which replaces the process of modeling attribute with minimal editing of latent representations based on an attribute classifier. Specifically, we first propose a Transformer-based autoencoder to learn an entangled latent representation for a discrete text, then we transform the attribute transfer task to an optimization problem and propose the Fast-Gradient-Iterative-Modification algorithm to edit the latent representation until conforming to the target attribute. Extensive experimental results demonstrate that our model achieves very competitive performance on three public data sets. Furthermore, we also show that our model can not only control the degree of transfer freely but also allow to transfer over multiple aspects at the same time. △ Less

Submitted 12 December, 2019; v1 submitted 30 May, 2019; originally announced May 2019.

Comments: Neurips 2019 camera ready

arXiv:1807.10935 [pdf, other]

Towards Explainable Inference about Object Motion using Qualitative Reasoning

Authors: Xiaoyu Ge, Jochen Renz, Hua Hua

Abstract: The capability of making explainable inferences regarding physical processes has long been desired. One fundamental physical process is object motion. Inferring what causes the motion of a group of objects can even be a challenging task for experts, e.g., in forensics science. Most of the work in the literature relies on physics simulation to draw such infer- ences. The simulation requires a preci… ▽ More The capability of making explainable inferences regarding physical processes has long been desired. One fundamental physical process is object motion. Inferring what causes the motion of a group of objects can even be a challenging task for experts, e.g., in forensics science. Most of the work in the literature relies on physics simulation to draw such infer- ences. The simulation requires a precise model of the under- lying domain to work well and is essentially a black-box from which one can hardly obtain any useful explanation. By contrast, qualitative reasoning methods have the advan- tage in making transparent inferences with ambiguous infor- mation, which makes it suitable for this task. However, there has been no suitable qualitative theory proposed for object motion in three-dimensional space. In this paper, we take this challenge and develop a qualitative theory for the motion of rigid objects. Based on this theory, we develop a reasoning method to solve a very interesting problem: Assuming there are several objects that were initially at rest and now have started to move. We want to infer what action causes the movement of these objects. △ Less

Submitted 28 July, 2018; originally announced July 2018.

arXiv:1610.03706 [pdf]

Bibliometric Index for Academic Leadership

Authors: Yang Liu, Fengrong Ou, Yan Deng, Bo Wu, Ruxi Liu, Hui Hua, Yuyuan Guan, Rentong Chen, Lars Gjesteby, Jiansheng Yang, Michael Vannier, Ge Wang

Abstract: Academic leadership is essential for research innovation and impact. Until now, there has been no dedicated measure of leadership by bibliometrics. Popular bibliometric indices are mainly based on academic output, such as the journal impact factor and the number of citations. Here we develop an academic leadership index based on readily available bibliometric data that is sensitive to not only aca… ▽ More Academic leadership is essential for research innovation and impact. Until now, there has been no dedicated measure of leadership by bibliometrics. Popular bibliometric indices are mainly based on academic output, such as the journal impact factor and the number of citations. Here we develop an academic leadership index based on readily available bibliometric data that is sensitive to not only academic output but also research efficiency. Our leadership index was tested in two studies on peer-reviewed journal papers by extramurally-funded principal investigators in the field of life sciences from China and the USA, respectively. The leadership performance of these principal investigators was quantified and compared relative to university rank and other factors. As a validation measure, we show that the highest average leadership index was achieved by principal investigators at top national universities in both countries. More interestingly, our results also indicate that on an individual basis, strong leadership and high efficiency are not necessarily associated with those at top-tier universities nor with the most funding. This leadership index may become the basis of a comprehensive merit system, facilitating academic evaluation and resource management. △ Less

Submitted 12 October, 2016; originally announced October 2016.

Comments: 25 pages, 4 figures, 4 tables, 33 references

Showing 1–31 of 31 results for author: Hua, H