Zum Hauptinhalt springen

Showing 1–50 of 185 results for author: He, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.10327  [pdf, other

    cs.SE

    An Empirical Study on Package-Level Deprecation in Python Ecosystem

    Authors: Zhiqing Zhong, Shilin He, Haoxuan Wang, Boxi Yu, Haowen Yang, Pinjia He

    Abstract: Open-source software (OSS) plays a crucial role in modern software development. Utilizing OSS code can greatly accelerate software development, reduce redundancy, and enhance reliability. Python, a widely adopted programming language, is renowned for its extensive and diverse third-party package ecosystem. However, a significant number of OSS packages within the Python ecosystem are in poor mainte… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: Accepted by 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE'25)

  2. arXiv:2408.05760  [pdf, other

    cs.SE

    Unlocking the Power of Numbers: Log Compression via Numeric Token Parsing

    Authors: Siyu Yu, Yifan Wu, Ying Li, Pinjia He

    Abstract: Parser-based log compressors have been widely explored in recent years because the explosive growth of log volumes makes the compression performance of general-purpose compressors unsatisfactory. These parser-based compressors preprocess logs by grouping the logs based on the parsing result and then feed the preprocessed files into a general-purpose compressor. However, parser-based compressors ha… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

    Comments: Accepted by 2024 39th IEEE/ACM International Conference on Automated Software Engineering (ASE'24)

  3. arXiv:2408.02911  [pdf, other

    cs.OS

    NVPC: A Transparent NVM Page Cache

    Authors: Guoyu Wang, Xilong Che, Haoyang Wei, Shuo Chen, Puyi He, Juncheng Hu

    Abstract: Towards a compatible utilization of NVM, NVM-specialized kernel file systems and NVM-based disk file system accelerators have been proposed. However, these studies only focus on one or several characteristics of NVM, while failing to exploit its best practice by putting NVM in the proper position of the whole storage stack. In this paper, we present NVPC, a transparent acceleration to existing ker… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

  4. arXiv:2407.09121  [pdf, other

    cs.CL cs.AI

    Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

    Authors: Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, Zhaopeng Tu

    Abstract: This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at a… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  5. arXiv:2407.07304  [pdf, other

    cs.AI

    Inference Performance Optimization for Large Language Models on CPUs

    Authors: Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie

    Abstract: Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardw… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: 5 pages, 6 figure, ICML 2024 on Foundation Models in the Wild

  6. arXiv:2407.05739  [pdf, other

    cs.NE cs.AI

    Multi-Bit Mechanism: A Novel Information Transmission Paradigm for Spiking Neural Networks

    Authors: Yongjun Xiao, Xianlong Tian, Yongqi Ding, Pei He, Mengmeng Jing, Lin Zuo

    Abstract: Since proposed, spiking neural networks (SNNs) gain recognition for their high performance, low power consumption and enhanced biological interpretability. However, while bringing these advantages, the binary nature of spikes also leads to considerable information loss in SNNs, ultimately causing performance degradation. We claim that the limited expressiveness of current binary spikes, resulting… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: Under review

  7. arXiv:2407.00029  [pdf, other

    cs.DC

    Distributed Inference Performance Optimization for LLMs on CPUs

    Authors: Pujiang He, Shan Zhou, Changqing Li, Wenhuan Huang, Weifei Yu, Duyi Wang, Chen Meng, Sheng Gui

    Abstract: Large language models (LLMs) hold tremendous potential for addressing numerous real-world challenges, yet they typically demand significant computational resources and memory. Deploying LLMs onto a resource-limited hardware device with restricted memory capacity presents considerable challenges. Distributed computing emerges as a prevalent strategy to mitigate single-node memory constraints and ex… ▽ More

    Submitted 16 May, 2024; originally announced July 2024.

    Comments: 4 pages, 3 figures, Practical ML for Low Resource Settings Workshop @ ICLR 2024

  8. arXiv:2406.14773  [pdf, other

    cs.CR

    Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data

    Authors: Shenglai Zeng, Jiankun Zhang, Pengfei He, Jie Ren, Tianqi Zheng, Hanqing Lu, Han Xu, Hui Liu, Yue Xing, Jiliang Tang

    Abstract: Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources. However, when the retrieval process involves private data, RAG systems may face severe privacy risks, potentially leading to the leakage of sensitive information. To address this issue, we propose using synthetic data as a privacy-preserving al… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  9. arXiv:2406.11645  [pdf, other

    cs.HC cs.CV

    SeamPose: Repurposing Seams as Capacitive Sensors in a Shirt for Upper-Body Pose Tracking

    Authors: Tianhong Catherine Yu, Manru Mary Zhang, Peter He, Chi-Jung Lee, Cassidy Cheesman, Saif Mahmud, Ruidong Zhang, François Guimbretière, Cheng Zhang

    Abstract: Seams are areas of overlapping fabric formed by stitching two or more pieces of fabric together in the cut-and-sew apparel manufacturing process. In SeamPose, we repurposed seams as capacitive sensors in a shirt for continuous upper-body pose estimation. Compared to previous all-textile motion-capturing garments that place the electrodes on the clothing surface, our solution leverages existing sea… ▽ More

    Submitted 6 August, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  10. arXiv:2406.10794  [pdf, other

    cs.CL

    Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis

    Authors: Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, Jiliang Tang

    Abstract: Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents. Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic p… ▽ More

    Submitted 26 June, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

  11. arXiv:2405.17229  [pdf, other

    cs.HC

    InsigHTable: Insight-driven Hierarchical Table Visualization with Reinforcement Learning

    Authors: Guozheng Li, Peng He, Xinyu Wang, Runfei Li, Chi Harold Liu, Chuangxin Ou, Dong He, Guoren Wang

    Abstract: Embedding visual representations within original hierarchical tables can mitigate additional cognitive load stemming from the division of users' attention. The created hierarchical table visualizations can help users understand and explore complex data with multi-level attributes. However, because of many options available for transforming hierarchical tables and selecting subsets for embedding, t… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  12. arXiv:2405.04513  [pdf, other

    cs.CL cs.AI cs.LG

    Switchable Decision: Dynamic Neural Generation Networks

    Authors: Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou

    Abstract: Auto-regressive generation models achieve competitive performance across many different NLP tasks such as summarization, question answering, and classifications. However, they are also known for being slow in inference, which makes them challenging to deploy in real-time applications. We propose a switchable decision to accelerate inference by dynamically assigning computation resources for each d… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Accepted to ICML 2024

  13. arXiv:2405.04133  [pdf, other

    cs.CV

    Exposing AI-generated Videos: A Benchmark Dataset and a Local-and-Global Temporal Defect Based Detection Method

    Authors: Peisong He, Leyao Zhu, Jiaxing Li, Shiqi Wang, Haoliang Li

    Abstract: The generative model has made significant advancements in the creation of realistic videos, which causes security issues. However, this emerging risk has not been adequately addressed due to the absence of a benchmark dataset for AI-generated videos. In this paper, we first construct a video dataset using advanced diffusion-based video generation algorithms with various semantic contents. Besides,… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

  14. arXiv:2405.03884  [pdf, other

    cs.CV

    BadFusion: 2D-Oriented Backdoor Attacks against 3D Object Detection

    Authors: Saket S. Chaturvedi, Lan Zhang, Wenbin Zhang, Pan He, Xiaoyong Yuan

    Abstract: 3D object detection plays an important role in autonomous driving; however, its vulnerability to backdoor attacks has become evident. By injecting ''triggers'' to poison the training dataset, backdoor attacks manipulate the detector's prediction for inputs containing these triggers. Existing backdoor attacks against 3D object detection primarily poison 3D LiDAR signals, where large-sized 3D trigge… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: Accepted at IJCAI 2024 Conference

  15. arXiv:2405.03489  [pdf, other

    cs.SE

    On the Influence of Data Resampling for Deep Learning-Based Log Anomaly Detection: Insights and Recommendations

    Authors: Xiaoxue Ma, Huiqi Zou, Jacky Keung, Pinjia He, Yishu Li, Xiao Yu, Federica Sarro

    Abstract: Numerous DL-based approaches have garnered considerable attention in the field of software Log Anomaly Detection. However, a practical challenge persists: the class imbalance in the public data commonly used to train the DL models. This imbalance is characterized by a substantial disparity in the number of abnormal log sequences compared to normal ones, for example, anomalies represent less than 1… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: 15 pages, 2 figures

  16. arXiv:2404.15819  [pdf, other

    cs.AR

    APACHE: A Processing-Near-Memory Architecture for Multi-Scheme Fully Homomorphic Encryption

    Authors: Lin Ding, Song Bian, Penggao He, Yan Xu, Gang Qu, Jiliang Zhang

    Abstract: Fully Homomorphic Encryption (FHE) allows one to outsource computation over encrypted data to untrusted servers without worrying about data breaching. Since FHE is known to be extremely computationally-intensive, application-specific accelerators emerged as a powerful solution to narrow the performance gap. Nonetheless, due to the increasing complexities in FHE schemes per se and multi-scheme FHE… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  17. arXiv:2404.08877  [pdf, other

    cs.SE cs.CL cs.LG

    Aligning LLMs for FL-free Program Repair

    Authors: Junjielong Xu, Ying Fu, Shin Hwei Tan, Pinjia He

    Abstract: Large language models (LLMs) have achieved decent results on automated program repair (APR). However, the next token prediction training objective of decoder-only LLMs (e.g., GPT-4) is misaligned with the masked span prediction objective of current infilling-style methods, which impedes LLMs from fully leveraging pre-trained knowledge for program repair. In addition, while some LLMs are capable of… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

  18. arXiv:2403.17574  [pdf, other

    cs.SE cs.DC

    SPES: Towards Optimizing Performance-Resource Trade-Off for Serverless Functions

    Authors: Cheryl Lee, Zhouruixing Zhu, Tianyi Yang, Yintong Huo, Yuxin Su, Pinjia He, Michael R. Lyu

    Abstract: As an emerging cloud computing deployment paradigm, serverless computing is gaining traction due to its efficiency and ability to harness on-demand cloud resources. However, a significant hurdle remains in the form of the cold start problem, causing latency when launching new function instances from scratch. Existing solutions tend to use over-simplistic strategies for function pre-loading/unloadi… ▽ More

    Submitted 21 August, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: 12 pages, accepted by ICDE 2024 (40th IEEE International Conference on Data Engineering)

  19. arXiv:2403.12389  [pdf, other

    cs.NE

    Learning-guided iterated local search for the minmax multiple traveling salesman problem

    Authors: Pengfei He, Jin-Kao Hao, Jinhui Xia

    Abstract: The minmax multiple traveling salesman problem involves minimizing the longest tour among a set of tours. The problem is of great practical interest because it can be used to formulate several real-life applications. To solve this computationally challenging problem, we propose a leaning-driven iterated local search approach that combines an aggressive local search procedure with a probabilistic a… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  20. arXiv:2403.09361  [pdf, other

    cs.AI

    A Multi-population Integrated Approach for Capacitated Location Routing

    Authors: Pengfei He, Jin-Kao Hao, Qinghua Wu

    Abstract: The capacitated location-routing problem involves determining the depots from a set of candidate capacitated depot locations and finding the required routes from the selected depots to serve a set of customers whereas minimizing a cost function that includes the cost of opening the chosen depots, the fixed utilization cost per vehicle used, and the total cost (distance) of the routes. This paper p… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  21. arXiv:2403.06884  [pdf, other

    cs.CV

    A Holistic Framework Towards Vision-based Traffic Signal Control with Microscopic Simulation

    Authors: Pan He, Quanyi Li, Xiaoyong Yuan, Bolei Zhou

    Abstract: Traffic signal control (TSC) is crucial for reducing traffic congestion that leads to smoother traffic flow, reduced idling time, and mitigated CO2 emissions. In this study, we explore the computer vision approach for TSC that modulates on-road traffic flows through visual observation. Unlike traditional feature-based approaches, vision-based methods depend much less on heuristics and predefined f… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: Under review for IEEE publications

  22. arXiv:2403.04861  [pdf, other

    cs.LG cs.NE

    A Survey of Lottery Ticket Hypothesis

    Authors: Bohan Liu, Zijie Zhang, Peixiong He, Zhensen Wang, Yang Xiao, Ruimeng Ye, Yang Zhou, Wei-Shinn Ku, Bo Hui

    Abstract: The Lottery Ticket Hypothesis (LTH) states that a dense neural network model contains a highly sparse subnetwork (i.e., winning tickets) that can achieve even better performance than the original model when trained in isolation. While LTH has been proved both empirically and theoretically in many works, there still are some open issues, such as efficiency and scalability, to be addressed. Also, th… ▽ More

    Submitted 12 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

  23. arXiv:2402.16893  [pdf, other

    cs.CR cs.AI cs.CL

    The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG)

    Authors: Shenglai Zeng, Jiankun Zhang, Pengfei He, Yue Xing, Yiding Liu, Han Xu, Jie Ren, Shuaiqiang Wang, Dawei Yin, Yi Chang, Jiliang Tang

    Abstract: Retrieval-augmented generation (RAG) is a powerful technique to facilitate language model with proprietary and private data, where data privacy is a pivotal concern. Whereas extensive research has demonstrated the privacy risks of large language models (LLMs), the RAG technique could potentially reshape the inherent behaviors of LLM generation, posing new privacy issues that are currently under-ex… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

  24. arXiv:2402.12958  [pdf, other

    cs.SE

    Go Static: Contextualized Logging Statement Generation

    Authors: Yichen Li, Yintong Huo, Renyi Zhong, Zhihan Jiang, Jinyang Liu, Junjie Huang, Jiazhen Gu, Pinjia He, Michael R. Lyu

    Abstract: Logging practices have been extensively investigated to assist developers in writing appropriate logging statements for documenting software behaviors. Although numerous automatic logging approaches have been proposed, their performance remains unsatisfactory due to the constraint of the single-method input, without informative programming context outside the method. Specifically, we identify thre… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

    Comments: This paper was accepted by The ACM International Conference on the Foundations of Software Engineering (FSE 2024)

  25. arXiv:2402.02333  [pdf, other

    cs.CR cs.CV cs.LG

    Copyright Protection in Generative AI: A Technical Perspective

    Authors: Jie Ren, Han Xu, Pengfei He, Yingqian Cui, Shenglai Zeng, Jiankun Zhang, Hongzhi Wen, Jiayuan Ding, Pei Huang, Lingjuan Lyu, Hui Liu, Yi Chang, Jiliang Tang

    Abstract: Generative AI has witnessed rapid advancement in recent years, expanding their capabilities to create synthesized content such as text, images, audio, and code. The high fidelity and authenticity of contents generated by these Deep Generative Models (DGMs) have sparked significant copyright concerns. There have been various legal debates on how to effectively safeguard copyrights in DGMs. This wor… ▽ More

    Submitted 24 July, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

    Comments: 26 pages

  26. arXiv:2402.02160  [pdf, other

    cs.CR

    Data Poisoning for In-context Learning

    Authors: Pengfei He, Han Xu, Yue Xing, Hui Liu, Makoto Yamada, Jiliang Tang

    Abstract: In the domain of large language models (LLMs), in-context learning (ICL) has been recognized for its innovative ability to adapt to new tasks, relying on examples rather than retraining or fine-tuning. This paper delves into the critical issue of ICL's susceptibility to data poisoning attacks, an area not yet fully explored. We wonder whether ICL is vulnerable, with adversaries capable of manipula… ▽ More

    Submitted 27 March, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

  27. arXiv:2401.17426  [pdf, other

    cs.LG cs.AI stat.ML

    Superiority of Multi-Head Attention in In-Context Linear Regression

    Authors: Yingqian Cui, Jie Ren, Pengfei He, Jiliang Tang, Yue Xing

    Abstract: We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing literature predominantly focuses on the convergence of transformers with single-/multi-head attention, our research centers on comparing their performance. We conduct an exact theoretical analysis to demonstrate that multi-head attention… ▽ More

    Submitted 30 January, 2024; originally announced January 2024.

  28. arXiv:2401.05986  [pdf, other

    cs.SE

    LogPTR: Variable-Aware Log Parsing with Pointer Network

    Authors: Yifan Wu, Bingxu Chai, Siyu Yu, Ying Li, Pinjia He, Wei Jiang, Jianguo Li

    Abstract: Due to the sheer size of software logs, developers rely on automated log analysis. Log parsing, which parses semi-structured logs into a structured format, is a prerequisite of automated log analysis. However, existing log parsers are unsatisfactory when applied in practice because: 1) they ignore categories of variables, and 2) have poor generalization ability. To address the limitations of exist… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

  29. arXiv:2401.01912  [pdf, other

    cs.CV cs.LG eess.IV

    Shrinking Your TimeStep: Towards Low-Latency Neuromorphic Object Recognition with Spiking Neural Network

    Authors: Yongqi Ding, Lin Zuo, Mengmeng Jing, Pei He, Yongjun Xiao

    Abstract: Neuromorphic object recognition with spiking neural networks (SNNs) is the cornerstone of low-power neuromorphic computing. However, existing SNNs suffer from significant latency, utilizing 10 to 40 timesteps or more, to recognize neuromorphic objects. At low latencies, the performance of existing SNNs is drastically degraded. In this work, we propose the Shrinking SNN (SSNN) to achieve low-latenc… ▽ More

    Submitted 1 January, 2024; originally announced January 2024.

    Comments: Accepted by AAAI 2024

  30. arXiv:2401.00757  [pdf, other

    cs.SE cs.AI cs.CL cs.LO

    A & B == B & A: Triggering Logical Reasoning Failures in Large Language Models

    Authors: Yuxuan Wan, Wenxuan Wang, Yiliu Yang, Youliang Yuan, Jen-tse Huang, Pinjia He, Wenxiang Jiao, Michael R. Lyu

    Abstract: Recent advancements in large language models (LLMs) have propelled Artificial Intelligence (AI) to new heights, enabling breakthroughs in various tasks such as writing assistance, code generation, and machine translation. A significant distinction of advanced LLMs, such as ChatGPT, is their demonstrated ability to "reason." However, evaluating the reasoning ability of LLMs remains a challenge as m… ▽ More

    Submitted 1 January, 2024; originally announced January 2024.

  31. arXiv:2310.17304  [pdf, other

    cs.CR cs.SE

    Static Semantics Reconstruction for Enhancing JavaScript-WebAssembly Multilingual Malware Detection

    Authors: Yifan Xia, Ping He, Xuhong Zhang, Peiyu Liu, Shouling Ji, Wenhai Wang

    Abstract: The emergence of WebAssembly allows attackers to hide the malicious functionalities of JavaScript malware in cross-language interoperations, termed JavaScript-WebAssembly multilingual malware (JWMM). However, existing anti-virus solutions based on static program analysis are still limited to monolingual code. As a result, their detection effectiveness decreases significantly against JWMM. The dete… ▽ More

    Submitted 19 April, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: Accepted to ESORICS 2023

  32. arXiv:2310.11451  [pdf, other

    cs.CL cs.AI cs.LG

    Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective

    Authors: Ming Zhong, Chenxin An, Weizhu Chen, Jiawei Han, Pengcheng He

    Abstract: Large Language Models (LLMs) inherently encode a wealth of knowledge within their parameters through pre-training on extensive corpora. While prior research has delved into operations on these parameters to manipulate the underlying implicit knowledge (encompassing detection, editing, and merging), there remains an ambiguous understanding regarding their transferability across models with varying… ▽ More

    Submitted 8 May, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  33. arXiv:2310.08659  [pdf, other

    cs.CL cs.AI cs.LG

    LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

    Authors: Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, Tuo Zhao

    Abstract: Quantization is an indispensable technique for serving Large Language Models (LLMs) and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on a pre-trained model. In such cases it is common to observe a consistent gap in the performance on downstream tasks between full fine-tuning and quantization plu… ▽ More

    Submitted 28 November, 2023; v1 submitted 12 October, 2023; originally announced October 2023.

  34. arXiv:2310.06714  [pdf, other

    cs.AI cs.CL cs.LG

    Exploring Memorization in Fine-tuned Language Models

    Authors: Shenglai Zeng, Yaxin Li, Jie Ren, Yiding Liu, Han Xu, Pengfei He, Yue Xing, Shuaiqiang Wang, Jiliang Tang, Dawei Yin

    Abstract: Large language models (LLMs) have shown great capabilities in various tasks but also exhibited memorization of training data, raising tremendous privacy and copyright concerns. While prior works have studied memorization during pre-training, the exploration of memorization during fine-tuning is rather limited. Compared to pre-training, fine-tuning typically involves more sensitive data and diverse… ▽ More

    Submitted 22 February, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

  35. arXiv:2310.06433  [pdf, other

    cs.SE cs.AI cs.CL cs.CV

    Retromorphic Testing: A New Approach to the Test Oracle Problem

    Authors: Boxi Yu, Qiuyang Mang, Qingshuo Guo, Pinjia He

    Abstract: A test oracle serves as a criterion or mechanism to assess the correspondence between software output and the anticipated behavior for a given input set. In automated testing, black-box techniques, known for their non-intrusive nature in test oracle construction, are widely used, including notable methodologies like differential testing and metamorphic testing. Inspired by the mathematical concept… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

    ACM Class: D.3.0; I.2.7; I.4.0

  36. arXiv:2310.06389  [pdf, other

    cs.CV stat.ML

    Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling

    Authors: Huangjie Zheng, Zhendong Wang, Jianbo Yuan, Guanghan Ning, Pengcheng He, Quanzeng You, Hongxia Yang, Mingyuan Zhou

    Abstract: Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep netwo… ▽ More

    Submitted 27 June, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

  37. arXiv:2310.05263  [pdf, other

    cs.CR

    Confidence-driven Sampling for Backdoor Attacks

    Authors: Pengfei He, Han Xu, Yue Xing, Jie Ren, Yingqian Cui, Shenglai Zeng, Jiliang Tang, Makoto Yamada, Mohammad Sabokrou

    Abstract: Backdoor attacks aim to surreptitiously insert malicious triggers into DNN models, granting unauthorized control during testing scenarios. Existing methods lack robustness against defense strategies and predominantly focus on enhancing trigger stealthiness while randomly selecting poisoned samples. Our research highlights the overlooked drawbacks of random sampling, which make that attack detectab… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

  38. arXiv:2310.02401  [pdf, other

    cs.CV cs.CR

    FT-Shield: A Watermark Against Unauthorized Fine-tuning in Text-to-Image Diffusion Models

    Authors: Yingqian Cui, Jie Ren, Yuping Lin, Han Xu, Pengfei He, Yue Xing, Lingjuan Lyu, Wenqi Fan, Hui Liu, Jiliang Tang

    Abstract: Text-to-image generative models, especially those based on latent diffusion models (LDMs), have demonstrated outstanding ability in generating high-quality and high-resolution images from textual prompts. With this advancement, various fine-tuning methods have been developed to personalize text-to-image models for specific applications such as artistic style adaptation and human face transfer. How… ▽ More

    Submitted 3 May, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

  39. arXiv:2310.01796  [pdf, other

    cs.SE

    LILAC: Log Parsing using LLMs with Adaptive Parsing Cache

    Authors: Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, Michael R. Lyu

    Abstract: Log parsing transforms log messages into structured formats, serving as the prerequisite step for various log analysis tasks. Although a variety of log parsing approaches have been proposed, their performance on complicated log data remains compromised due to the use of human-crafted rules or learning-based models with limited training data. The recent emergence of powerful large language models (… ▽ More

    Submitted 22 March, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: This paper was accepted by The ACM International Conference on the Foundations of Software Engineering (FSE 2024)

  40. arXiv:2310.01307  [pdf, other

    cs.CL cs.AI cs.LG

    On the Generalization of Training-based ChatGPT Detection Methods

    Authors: Han Xu, Jie Ren, Pengfei He, Shenglai Zeng, Yingqian Cui, Amy Liu, Hui Liu, Jiliang Tang

    Abstract: ChatGPT is one of the most popular language models which achieve amazing performance on various natural language tasks. Consequently, there is also an urgent need to detect the texts generated ChatGPT from human written. One of the extensively studied methods trains classification models to distinguish both. However, existing studies also demonstrate that the trained models may suffer from distrib… ▽ More

    Submitted 3 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

  41. arXiv:2309.03883  [pdf, other

    cs.CL cs.AI cs.LG

    DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

    Authors: Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, Pengcheng He

    Abstract: Despite their impressive capabilities, large language models (LLMs) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs that does not require conditioning on retrieved external knowledge nor additional fine-tuning. Our approach obtains the next-token distributi… ▽ More

    Submitted 10 March, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

    Comments: ICLR 2024 main conference paper. The source code is available at https://github.com/voidism/DoLa

  42. arXiv:2309.02632  [pdf, other

    cs.LG cs.AI

    Deep Reinforcement Learning from Hierarchical Preference Design

    Authors: Alexander Bukharin, Yixiao Li, Pengcheng He, Tuo Zhao

    Abstract: Reward design is a fundamental, yet challenging aspect of reinforcement learning (RL). Researchers typically utilize feedback signals from the environment to handcraft a reward function, but this process is not always effective due to the varying scale and intricate dependencies of the feedback signals. This paper shows by exploiting certain structures, one can ease the reward design process. Spec… ▽ More

    Submitted 10 June, 2024; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: 28 Pages, 14 figures

  43. arXiv:2309.01866  [pdf, other

    cs.CR cs.AI cs.LG cs.SE

    Efficient Query-Based Attack against ML-Based Android Malware Detection under Zero Knowledge Setting

    Authors: Ping He, Yifan Xia, Xuhong Zhang, Shouling Ji

    Abstract: The widespread adoption of the Android operating system has made malicious Android applications an appealing target for attackers. Machine learning-based (ML-based) Android malware detection (AMD) methods are crucial in addressing this problem; however, their vulnerability to adversarial examples raises concerns. Current attacks against ML-based AMD methods demonstrate remarkable performance but r… ▽ More

    Submitted 6 September, 2023; v1 submitted 4 September, 2023; originally announced September 2023.

    Comments: To Appear in the ACM Conference on Computer and Communications Security, November, 2023

  44. arXiv:2308.12439  [pdf, other

    cs.CR cs.AI cs.CV cs.LG

    BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

    Authors: Tinghao Xie, Xiangyu Qi, Ping He, Yiming Li, Jiachen T. Wang, Prateek Mittal

    Abstract: We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor fu… ▽ More

    Submitted 5 October, 2023; v1 submitted 23 August, 2023; originally announced August 2023.

  45. arXiv:2308.12433  [pdf, other

    cs.CV

    A Spatiotemporal Correspondence Approach to Unsupervised LiDAR Segmentation with Traffic Applications

    Authors: Xiao Li, Pan He, Aotian Wu, Sanjay Ranka, Anand Rangarajan

    Abstract: We address the problem of unsupervised semantic segmentation of outdoor LiDAR point clouds in diverse traffic scenarios. The key idea is to leverage the spatiotemporal nature of a dynamic point cloud sequence and introduce drastically stronger augmentation by establishing spatiotemporal correspondences across multiple frames. We dovetail clustering and pseudo-label learning in this work. Essential… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

    Comments: Accepted for publication in IEEE International Conference on Intelligent Transportation Systems (ITSC 2023)

  46. arXiv:2308.10819  [pdf, other

    cs.CL cs.AI

    Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection

    Authors: Zekun Li, Baolin Peng, Pengcheng He, Xifeng Yan

    Abstract: Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following, becoming increasingly crucial across various applications. However, this capability brings with it the risk of prompt injection attacks, where attackers inject instructions into LLMs' input to elicit undesirable actions or content. Understanding the robustness of LLMs against such attacks is vital for… ▽ More

    Submitted 24 November, 2023; v1 submitted 17 August, 2023; originally announced August 2023.

    Comments: The data and code can be found at https://github.com/Leezekun/instruction-following-robustness-eval

  47. arXiv:2308.09810  [pdf, other

    cs.SE cs.AI cs.CL cs.CV

    An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software

    Authors: Wenxuan Wang, Jingyuan Huang, Jen-tse Huang, Chang Chen, Jiazhen Gu, Pinjia He, Michael R. Lyu

    Abstract: The exponential growth of social media platforms has brought about a revolution in communication and content dissemination in human society. Nevertheless, these platforms are being increasingly misused to spread toxic content, including hate speech, malicious advertising, and pornography, leading to severe negative consequences such as harm to teenagers' mental health. Despite tremendous efforts i… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

    Comments: Accepted by ASE 2023. arXiv admin note: substantial text overlap with arXiv:2302.05706

  48. arXiv:2308.09324  [pdf, other

    cs.SE

    AutoLog: A Log Sequence Synthesis Framework for Anomaly Detection

    Authors: Yintong Huo, Yichen Li, Yuxin Su, Pinjia He, Zifan Xie, Michael R. Lyu

    Abstract: The rapid progress of modern computing systems has led to a growing interest in informative run-time logs. Various log-based anomaly detection techniques have been proposed to ensure software reliability. However, their implementation in the industry has been limited due to the lack of high-quality public log resources as training datasets. While some log datasets are available for anomaly detec… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

    Comments: The paper has been accepted by ASE 2023 (Research Track)

  49. arXiv:2308.07937  [pdf, other

    cs.CL cs.SE

    Automated Testing and Improvement of Named Entity Recognition Systems

    Authors: Boxi Yu, Yiyan Hu, Qiuyang Mang, Wenhan Hu, Pinjia He

    Abstract: Named entity recognition (NER) systems have seen rapid progress in recent years due to the development of deep neural networks. These systems are widely used in various natural language processing applications, such as information extraction, question answering, and sentiment analysis. However, the complexity and intractability of deep neural networks can make NER systems unreliable in certain cir… ▽ More

    Submitted 13 August, 2023; originally announced August 2023.

    Comments: Accepted by ESEC/FSE'23

    ACM Class: D.2.5; I.2.7

  50. arXiv:2308.07085  [pdf, other

    cs.SE

    Hue: A User-Adaptive Parser for Hybrid Logs

    Authors: Junjielong Xu, Qiuai Fu, Zhouruixing Zhu, Yutong Cheng, Zhijing Li, Yuchi Ma, Pinjia He

    Abstract: Log parsing, which extracts log templates from semi-structured logs and produces structured logs, is the first and the most critical step in automated log analysis. While existing log parsers have achieved decent results, they suffer from two major limitations by design. First, they do not natively support hybrid logs that consist of both single-line logs and multi-line logs (\eg Java Exception an… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted by ESEC/FSE 2023