Search | arXiv e-print repository

doi 10.1109/ICECT61618.2024.10581284

Trajectory Data Mining and Trip Travel Time Prediction on Specific Roads

Authors: Muhammad Awais Amin, Jawad-Ur-Rehman Chughtai, Waqar Ahmad, Waqas Haider Bangyal, Irfan Ul Haq

Abstract: Predicting a trip's travel time is essential for route planning and navigation applications. The majority of research is based on international data that does not apply to Pakistan's road conditions. We designed a complete pipeline for mining trajectories from sensors data. On this data, we employed state-of-the-art approaches, including a shallow artificial neural network, a deep multi-layered pe… ▽ More Predicting a trip's travel time is essential for route planning and navigation applications. The majority of research is based on international data that does not apply to Pakistan's road conditions. We designed a complete pipeline for mining trajectories from sensors data. On this data, we employed state-of-the-art approaches, including a shallow artificial neural network, a deep multi-layered perceptron, and a long-short-term memory, to explore the issue of travel time prediction on frequent routes. The experimental results demonstrate an average prediction error ranging from 30 seconds to 1.2 minutes on trips lasting 10 minutes to 60 minutes on six most frequent routes in regions of Islamabad, Pakistan. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: N/A

arXiv:2404.03188 [pdf]

Classification of Nasopharyngeal Cases using DenseNet Deep Learning Architecture

Authors: W. S. H. M. W. Ahmad, M. F. A. Fauzi, M. K. Abdullahi, Jenny T. H. Lee, N. S. A. Basry, A Yahaya, A. M. Ismail, A. Adam, Elaine W. L. Chan, F. S. Abas

Abstract: Nasopharyngeal carcinoma (NPC) is one of the understudied yet deadliest cancers in South East Asia. In Malaysia, the prevalence is identified mainly in Sarawak, among the ethnic of Bidayuh. NPC is often late-diagnosed because it is asymptomatic at the early stage. There are several tissue representations from the nasopharynx biopsy, such as nasopharyngeal inflammation (NPI), lymphoid hyperplasia (… ▽ More Nasopharyngeal carcinoma (NPC) is one of the understudied yet deadliest cancers in South East Asia. In Malaysia, the prevalence is identified mainly in Sarawak, among the ethnic of Bidayuh. NPC is often late-diagnosed because it is asymptomatic at the early stage. There are several tissue representations from the nasopharynx biopsy, such as nasopharyngeal inflammation (NPI), lymphoid hyperplasia (LHP), nasopharyngeal carcinoma (NPC) and normal tissue. This paper is our first initiative to identify the difference between NPC, NPI and normal cases. Seven whole slide images (WSIs) with gigapixel resolutions from seven different patients and two hospitals were experimented with using two test setups, consisting of a different set of images. The tissue regions are patched into smaller blocks and classified using DenseNet architecture with 21 dense layers. Two tests are carried out, each for proof of concept (Test 1) and real-test scenario (Test 2). The accuracy achieved for NPC class is 94.8% for Test 1 and 67.0% for Test 2. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: This article has been accepted in the Journal of Engineering Science and Technology (JESTEC) and awaiting publication

arXiv:2403.15952 [pdf, other]

IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Authors: Haz Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, Rifat Shahriyar

Abstract: The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present Illusi… ▽ More The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro in the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example. △ Less

Submitted 9 August, 2024; v1 submitted 23 March, 2024; originally announced March 2024.

arXiv:2403.10059 [pdf, other]

Repoformer: Selective Retrieval for Repository-Level Code Completion

Authors: Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, Xiaofei Ma

Abstract: Recent advances in retrieval-augmented generation (RAG) have initiated a new era in repository-level code completion. However, the invariable use of retrieval in existing methods exposes issues in both efficiency and robustness, with a large proportion of the retrieved contexts proving unhelpful or harmful to code language models (code LMs). In this paper, we propose a selective RAG framework to a… ▽ More Recent advances in retrieval-augmented generation (RAG) have initiated a new era in repository-level code completion. However, the invariable use of retrieval in existing methods exposes issues in both efficiency and robustness, with a large proportion of the retrieved contexts proving unhelpful or harmful to code language models (code LMs). In this paper, we propose a selective RAG framework to avoid retrieval when unnecessary. To power this framework, we design a self-supervised learning approach to enable a code LM to accurately self-evaluate whether retrieval can improve its output quality and robustly leverage the potentially noisy retrieved contexts. Using this LM as both the selective RAG policy and the generation model, our framework achieves state-of-the-art repository-level code completion performance on diverse benchmarks including RepoEval, CrossCodeEval, and CrossCodeLongEval, a new long-form code completion benchmark. Meanwhile, our analyses show that selectively retrieving brings as much as 70% inference speedup in the online serving setting without harming the performance. We further demonstrate that our framework is able to accommodate different generation models, retrievers, and programming languages. These advancements position our framework as an important step towards more accurate and efficient repository-level code completion. △ Less

Submitted 4 June, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

Comments: ICML 2024

arXiv:2402.14052 [pdf, other]

On Leveraging Encoder-only Pre-trained Language Models for Effective Keyphrase Generation

Authors: Di Wu, Wasi Uddin Ahmad, Kai-Wei Chang

Abstract: This study addresses the application of encoder-only Pre-trained Language Models (PLMs) in keyphrase generation (KPG) amidst the broader availability of domain-tailored encoder-only models compared to encoder-decoder models. We investigate three core inquiries: (1) the efficacy of encoder-only PLMs in KPG, (2) optimal architectural decisions for employing encoder-only PLMs in KPG, and (3) a perfor… ▽ More This study addresses the application of encoder-only Pre-trained Language Models (PLMs) in keyphrase generation (KPG) amidst the broader availability of domain-tailored encoder-only models compared to encoder-decoder models. We investigate three core inquiries: (1) the efficacy of encoder-only PLMs in KPG, (2) optimal architectural decisions for employing encoder-only PLMs in KPG, and (3) a performance comparison between in-domain encoder-only and encoder-decoder PLMs across varied resource settings. Our findings, derived from extensive experimentation in two domains reveal that with encoder-only PLMs, although KPE with Conditional Random Fields slightly excels in identifying present keyphrases, the KPG formulation renders a broader spectrum of keyphrase predictions. Additionally, prefix-LM fine-tuning of encoder-only PLMs emerges as a strong and data-efficient strategy for KPG, outperforming general-domain seq2seq PLMs. We also identify a favorable parameter allocation towards model depth rather than width when employing encoder-decoder architectures initialized with encoder-only PLMs. The study sheds light on the potential of utilizing encoder-only PLMs for advancing KPG systems and provides a groundwork for future KPG methods. Our code and pre-trained checkpoints are released at https://github.com/uclanlp/DeepKPG. △ Less

Submitted 21 February, 2024; originally announced February 2024.

Comments: LREC-COLING 2024 camera ready. arXiv admin note: text overlap with arXiv:2212.10233

arXiv:2402.01935 [pdf, other]

Code Representation Learning At Scale

Authors: Dejiao Zhang, Wasi Ahmad, Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, Bing Xiang

Abstract: Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-st… ▽ More Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: 10 pages

Journal ref: ICLR 2024

arXiv:2401.08386 [pdf, other]

Deep Learning-based Group Causal Inference in Multivariate Time-series

Authors: Wasim Ahmad, Maha Shadaydeh, Joachim Denzler

Abstract: Causal inference in a nonlinear system of multivariate timeseries is instrumental in disentangling the intricate web of relationships among variables, enabling us to make more accurate predictions and gain deeper insights into real-world complex systems. Causality methods typically identify the causal structure of a multivariate system by considering the cause-effect relationship of each pair of v… ▽ More Causal inference in a nonlinear system of multivariate timeseries is instrumental in disentangling the intricate web of relationships among variables, enabling us to make more accurate predictions and gain deeper insights into real-world complex systems. Causality methods typically identify the causal structure of a multivariate system by considering the cause-effect relationship of each pair of variables while ignoring the collective effect of a group of variables or interactions involving more than two-time series variables. In this work, we test model invariance by group-level interventions on the trained deep networks to infer causal direction in groups of variables, such as climate and ecosystem, brain networks, etc. Extensive testing with synthetic and real-world time series data shows a significant improvement of our method over other applied group causality methods and provides us insights into real-world time series. The code for our method can be found at:https://github.com/wasimahmadpk/gCause. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted in AAAI24 (AI4TS)

arXiv:2311.03782

CapST: An Enhanced and Lightweight Model Attribution Approach for Synthetic Videos

Authors: Wasim Ahmad, Yan-Tsung Peng, Yuan-Hao Chang, Gaddisa Olani Ganfure, Sarwar Khan, Sahibzada Adil Shahzad

Abstract: Deepfake videos, generated through AI faceswapping techniques, have garnered considerable attention due to their potential for powerful impersonation attacks. While existing research primarily focuses on binary classification to discern between real and fake videos, however determining the specific generation model for a fake video is crucial for forensic investigation. Addressing this gap, this p… ▽ More Deepfake videos, generated through AI faceswapping techniques, have garnered considerable attention due to their potential for powerful impersonation attacks. While existing research primarily focuses on binary classification to discern between real and fake videos, however determining the specific generation model for a fake video is crucial for forensic investigation. Addressing this gap, this paper investigates the model attribution problem of Deepfake videos from a recently proposed dataset, Deepfakes from Different Models (DFDM), derived from various Autoencoder models. The dataset comprises 6,450 Deepfake videos generated by five distinct models with variations in encoder, decoder, intermediate layer, input resolution, and compression ratio. This study formulates Deepfakes model attribution as a multiclass classification task, proposing a segment of VGG19 as a feature extraction backbone, known for its effectiveness in imagerelated tasks, while integrated a Capsule Network with a Spatio-Temporal attention mechanism. The Capsule module captures intricate hierarchies among features for robust identification of deepfake attributes. Additionally, the video-level fusion technique leverages temporal attention mechanisms to handle concatenated feature vectors, capitalizing on inherent temporal dependencies in deepfake videos. By aggregating insights across frames, our model gains a comprehensive understanding of video content, resulting in more precise predictions. Experimental results on the deepfake benchmark dataset (DFDM) demonstrate the efficacy of our proposed method, achieving up to a 4% improvement in accurately categorizing deepfake videos compared to baseline models while demanding fewer computational resources. △ Less

Submitted 22 January, 2024; v1 submitted 7 November, 2023; originally announced November 2023.

Comments: Rejected from jounal and will have to conduct several more experiments

arXiv:2310.11248 [pdf, other]

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

Authors: Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, Bing Xiang

Abstract: Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing… ▽ More Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing and understanding cross-file context is often required to complete the code correctly. To fill in this gap, we propose CrossCodeEval, a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. To create examples that strictly require cross-file context for accurate completion, we propose a straightforward yet efficient static-analysis-based approach to pinpoint the use of cross-file context within the current file. Extensive experiments on state-of-the-art code language models like CodeGen and StarCoder demonstrate that CrossCodeEval is extremely challenging when the relevant cross-file context is absent, and we see clear improvements when adding these context into the prompt. However, despite such improvements, the pinnacle of performance remains notably unattained even with the highest-performing model, indicating that CrossCodeEval is also capable of assessing model's capability in leveraging extensive context to make better code completion. Finally, we benchmarked various methods in retrieving cross-file context, and show that CrossCodeEval can also be used to measure the capability of code retrievers. △ Less

Submitted 16 November, 2023; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: To appear at NeurIPS 2023 (Datasets and Benchmarks Track)

arXiv:2310.06374 [pdf, other]

Rethinking Model Selection and Decoding for Keyphrase Generation with Pre-trained Sequence-to-Sequence Models

Authors: Di Wu, Wasi Uddin Ahmad, Kai-Wei Chang

Abstract: Keyphrase Generation (KPG) is a longstanding task in NLP with widespread applications. The advent of sequence-to-sequence (seq2seq) pre-trained language models (PLMs) has ushered in a transformative era for KPG, yielding promising performance improvements. However, many design decisions remain unexplored and are often made arbitrarily. This paper undertakes a systematic analysis of the influence o… ▽ More Keyphrase Generation (KPG) is a longstanding task in NLP with widespread applications. The advent of sequence-to-sequence (seq2seq) pre-trained language models (PLMs) has ushered in a transformative era for KPG, yielding promising performance improvements. However, many design decisions remain unexplored and are often made arbitrarily. This paper undertakes a systematic analysis of the influence of model selection and decoding strategies on PLM-based KPG. We begin by elucidating why seq2seq PLMs are apt for KPG, anchored by an attention-driven hypothesis. We then establish that conventional wisdom for selecting seq2seq PLMs lacks depth: (1) merely increasing model size or performing task-specific adaptation is not parameter-efficient; (2) although combining in-domain pre-training with task adaptation benefits KPG, it does partially hinder generalization. Regarding decoding, we demonstrate that while greedy search achieves strong F1 scores, it lags in recall compared with sampling-based methods. Based on these insights, we propose DeSel, a likelihood-based decode-select algorithm for seq2seq PLMs. DeSel improves greedy search by an average of 4.7% semantic F1 across five datasets. Our collective findings pave the way for deeper future investigations into PLM-based KPG. △ Less

Submitted 22 October, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

Comments: EMNLP 2023 camera ready

arXiv:2307.05790 [pdf]

doi 10.1109/IPDPSW.2017.22

Design of an energy aware petaflops class high performance cluster based on power architecture

Authors: W. A. Ahmad, A. Bartolini, F. Beneventi, L. Benini, A. Borghesi, M. Cicala, P. Forestieri, C. Gianfreda, D. Gregori, A. Libri, F. Spiga, S. Tinti

Abstract: In this paper we present D.A.V.I.D.E. (Development for an Added Value Infrastructure Designed in Europe), an innovative and energy efficient High Performance Computing cluster designed by E4 Computer Engineering for PRACE (Partnership for Advanced Computing in Europe). D.A.V.I.D.E. is built using best-in-class components (IBM's POWER8-NVLink CPUs, NVIDIA TESLA P100 GPUs, Mellanox InfiniBand EDR 10… ▽ More In this paper we present D.A.V.I.D.E. (Development for an Added Value Infrastructure Designed in Europe), an innovative and energy efficient High Performance Computing cluster designed by E4 Computer Engineering for PRACE (Partnership for Advanced Computing in Europe). D.A.V.I.D.E. is built using best-in-class components (IBM's POWER8-NVLink CPUs, NVIDIA TESLA P100 GPUs, Mellanox InfiniBand EDR 100 Gb/s networking) plus custom hardware and an innovative system middleware software. D.A.V.I.D.E. features (i) a dedicated power monitor interface, built around the BeagleBone Black Board that allows high frequency sampling directly from the power backplane and scalable integration with the internal node telemetry and system level power management software; (ii) a custom-built chassis, based on OpenRack form factor, and liquid cooling that allows the system to be used in modern, energy efficient, datacenter; (iii) software components designed for enabling fine grain power monitoring, power management (i.e. power capping and energy aware job scheduling) and application power profiling, based on dedicated machine learning components. Software APIs are offered to developers and users to tune the computing node performance and power consumption around on the application requirements. The first pilot system that we will deploy at the beginning of 2017, will demonstrate key HPC applications from different fields ported and optimized for this innovative platform. △ Less

Submitted 11 July, 2023; originally announced July 2023.

arXiv:2303.05378 [pdf, other]

Greener yet Powerful: Taming Large Code Generation Models with Quantization

Authors: Xiaokai Wei, Sujan Gonugondla, Wasi Ahmad, Shiqi Wang, Baishakhi Ray, Haifeng Qian, Xiaopeng Li, Varun Kumar, Zijian Wang, Yuchen Tian, Qing Sun, Ben Athiwaratkun, Mingyue Shang, Murali Krishna Ramanathan, Parminder Bhatia, Bing Xiang

Abstract: ML-powered code generation aims to assist developers to write code in a more productive manner, by intelligently generating code blocks based on natural language prompts. Recently, large pretrained deep learning models have substantially pushed the boundary of code generation and achieved impressive performance. Despite their great power, the huge number of model parameters poses a significant thr… ▽ More ML-powered code generation aims to assist developers to write code in a more productive manner, by intelligently generating code blocks based on natural language prompts. Recently, large pretrained deep learning models have substantially pushed the boundary of code generation and achieved impressive performance. Despite their great power, the huge number of model parameters poses a significant threat to adapting them in a regular software development environment, where a developer might use a standard laptop or mid-size server to develop her code. Such large models incur significant resource usage (in terms of memory, latency, and dollars) as well as carbon footprint. Model compression is a promising approach to address these challenges. Several techniques are proposed to compress large pretrained models typically used for vision or textual data. Out of many available compression techniques, we identified that quantization is mostly applicable for code generation task as it does not require significant retraining cost. As quantization represents model parameters with lower-bit integer (e.g., int8), the model size and runtime latency would both benefit from such int representation. We extensively study the impact of quantized model on code generation tasks across different dimension: (i) resource usage and carbon footprint, (ii) accuracy, and (iii) robustness. To this end, through systematic experiments we find a recipe of quantization technique that could run even a $6$B model in a regular laptop without significant accuracy or robustness degradation. We further found the recipe is readily applicable to code summarization task as well. △ Less

Submitted 9 March, 2023; originally announced March 2023.

Comments: 10 pages, 7 figures, 10 tables

arXiv:2212.10233 [pdf, other]

Pre-trained Language Models for Keyphrase Generation: A Thorough Empirical Study

Authors: Di Wu, Wasi Uddin Ahmad, Kai-Wei Chang

Abstract: Neural models that do not rely on pre-training have excelled in the keyphrase generation task with large annotated datasets. Meanwhile, new approaches have incorporated pre-trained language models (PLMs) for their data efficiency. However, there lacks a systematic study of how the two types of approaches compare and how different design choices can affect the performance of PLM-based models. To fi… ▽ More Neural models that do not rely on pre-training have excelled in the keyphrase generation task with large annotated datasets. Meanwhile, new approaches have incorporated pre-trained language models (PLMs) for their data efficiency. However, there lacks a systematic study of how the two types of approaches compare and how different design choices can affect the performance of PLM-based models. To fill in this knowledge gap and facilitate a more informed use of PLMs for keyphrase extraction and keyphrase generation, we present an in-depth empirical study. Formulating keyphrase extraction as sequence labeling and keyphrase generation as sequence-to-sequence generation, we perform extensive experiments in three domains. After showing that PLMs have competitive high-resource performance and state-of-the-art low-resource performance, we investigate important design choices including in-domain PLMs, PLMs with different pre-training objectives, using PLMs with a parameter budget, and different formulations for present keyphrases. Further results show that (1) in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models; (2) with a fixed parameter budget, prioritizing model depth over width and allocating more layers in the encoder leads to better encoder-decoder models; and (3) introducing four in-domain PLMs, we achieve a competitive performance in the news domain and the state-of-the-art performance in the scientific domain. △ Less

Submitted 22 February, 2024; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: Technical Report. The contents are published in two separate papers in EMNLP 2023 (arXiv:2310.06374) and LREC-COLING 2024 (arXiv:2402.14052)

arXiv:2212.10011 [pdf, other]

PLUE: Language Understanding Evaluation Benchmark for Privacy Policies in English

Authors: Jianfeng Chi, Wasi Uddin Ahmad, Yuan Tian, Kai-Wei Chang

Abstract: Privacy policies provide individuals with information about their rights and how their personal information is handled. Natural language understanding (NLU) technologies can support individuals and practitioners to understand better privacy practices described in lengthy and complex documents. However, existing efforts that use NLU technologies are limited by processing the language in a way exclu… ▽ More Privacy policies provide individuals with information about their rights and how their personal information is handled. Natural language understanding (NLU) technologies can support individuals and practitioners to understand better privacy practices described in lengthy and complex documents. However, existing efforts that use NLU technologies are limited by processing the language in a way exclusive to a single task focusing on certain privacy practices. To this end, we introduce the Privacy Policy Language Understanding Evaluation (PLUE) benchmark, a multi-task benchmark for evaluating the privacy policy language understanding across various tasks. We also collect a large corpus of privacy policies to enable privacy policy domain-specific language model pre-training. We evaluate several generic pre-trained language models and continue pre-training them on the collected corpus. We demonstrate that domain-specific continual pre-training offers performance improvements across all tasks. △ Less

Submitted 12 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: ACL 2023. Code is released at https://github.com/JFChi/PLUE

arXiv:2212.10007 [pdf, other]

CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context

Authors: Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, Bing Xiang

Abstract: While pre-trained language models (LM) for code have achieved great success in code completion, they generate code conditioned only on the contents within the file, i.e., in-file context, but ignore the rich semantics in other files within the same project, i.e., cross-file context, a critical source of information that is especially useful in modern modular software development. Such overlooking… ▽ More While pre-trained language models (LM) for code have achieved great success in code completion, they generate code conditioned only on the contents within the file, i.e., in-file context, but ignore the rich semantics in other files within the same project, i.e., cross-file context, a critical source of information that is especially useful in modern modular software development. Such overlooking constrains code language models' capacity in code completion, leading to unexpected behaviors such as generating hallucinated class member functions or function calls with unexpected arguments. In this work, we develop a cross-file context finder tool, CCFINDER, that effectively locates and retrieves the most relevant cross-file context. We propose CoCoMIC, a framework that incorporates cross-file context to learn the in-file and cross-file context jointly on top of pretrained code LMs. CoCoMIC successfully improves the existing code LM with a 33.94% relative increase in exact match and a 28.69% relative increase in identifier matching for code completion when the cross-file context is provided. △ Less

Submitted 24 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

arXiv:2211.09392 [pdf, other]

Data Dimension Reduction makes ML Algorithms efficient

Authors: Wisal Khan, Muhammad Turab, Waqas Ahmad, Syed Hasnat Ahmad, Kelash Kumar, Bin Luo

Abstract: Data dimension reduction (DDR) is all about mapping data from high dimensions to low dimensions, various techniques of DDR are being used for image dimension reduction like Random Projections, Principal Component Analysis (PCA), the Variance approach, LSA-Transform, the Combined and Direct approaches, and the New Random Approach. Auto-encoders (AE) are used to learn end-to-end mapping. In this pap… ▽ More Data dimension reduction (DDR) is all about mapping data from high dimensions to low dimensions, various techniques of DDR are being used for image dimension reduction like Random Projections, Principal Component Analysis (PCA), the Variance approach, LSA-Transform, the Combined and Direct approaches, and the New Random Approach. Auto-encoders (AE) are used to learn end-to-end mapping. In this paper, we demonstrate that pre-processing not only speeds up the algorithms but also improves accuracy in both supervised and unsupervised learning. In pre-processing of DDR, first PCA based DDR is used for supervised learning, then we explore AE based DDR for unsupervised learning. In PCA based DDR, we first compare supervised learning algorithms accuracy and time before and after applying PCA. Similarly, in AE based DDR, we compare unsupervised learning algorithm accuracy and time before and after AE representation learning. Supervised learning algorithms including support-vector machines (SVM), Decision Tree with GINI index, Decision Tree with entropy and Stochastic Gradient Descent classifier (SGDC) and unsupervised learning algorithm including K-means clustering, are used for classification purpose. We used two datasets MNIST and FashionMNIST Our experiment shows that there is massive improvement in accuracy and time reduction after pre-processing in both supervised and unsupervised learning. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: Our paper is accepted at International Conference On Emerging Technologies In Electronics, Computing And Communication (ICETECC) 2022

arXiv:2210.14868 [pdf, other]

Multi-lingual Evaluation of Code Generation Models

Authors: Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng Qian, Murali Krishna Ramanathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, Bing Xiang

Abstract: We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the perform… ▽ More We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities even on mono-lingual settings. Furthermore, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks. Overall, our benchmarks represents a significant step towards a deeper understanding of language models' code generation abilities. We publicly release our code and datasets at https://github.com/amazon-research/mxeval. △ Less

Submitted 28 March, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

Comments: Code and data release: https://github.com/amazon-research/mxeval

arXiv:2210.01185 [pdf, other]

ContraCLM: Contrastive Learning For Causal Language Model

Authors: Nihal Jain, Dejiao Zhang, Wasi Uddin Ahmad, Zijian Wang, Feng Nan, Xiaopeng Li, Ming Tan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Xiaofei Ma, Bing Xiang

Abstract: Despite exciting progress in causal language models, the expressiveness of the representations is largely limited due to poor discrimination ability. To remedy this issue, we present ContraCLM, a novel contrastive learning framework at both token-level and sequence-level. We assess ContraCLM on a variety of downstream tasks. We show that ContraCLM enhances discrimination of the representations and… ▽ More Despite exciting progress in causal language models, the expressiveness of the representations is largely limited due to poor discrimination ability. To remedy this issue, we present ContraCLM, a novel contrastive learning framework at both token-level and sequence-level. We assess ContraCLM on a variety of downstream tasks. We show that ContraCLM enhances discrimination of the representations and bridges the gap with the encoder-only models, which makes causal language models better suited for tasks beyond language generation. Specifically, we attain $44\%$ relative improvement on the Semantic Textual Similarity tasks and $34\%$ on Code-to-Code Search tasks. Furthermore, by improving the expressiveness of the representations, ContraCLM also boosts the source code generation capability with $9\%$ relative improvement on execution accuracy on the HumanEval benchmark. △ Less

Submitted 2 May, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: 10 pages

Journal ref: ACL 2023

arXiv:2209.01712 [pdf, other]

ChemBERTa-2: Towards Chemical Foundation Models

Authors: Walid Ahmad, Elana Simon, Seyone Chithrananda, Gabriel Grand, Bharath Ramsundar

Abstract: Large pretrained models such as GPT-3 have had tremendous impact on modern natural language processing by leveraging self-supervised learning to learn salient representations that can be used to readily finetune on a wide variety of downstream tasks. We investigate the possibility of transferring such advances to molecular machine learning by building a chemical foundation model, ChemBERTa-2, usin… ▽ More Large pretrained models such as GPT-3 have had tremendous impact on modern natural language processing by leveraging self-supervised learning to learn salient representations that can be used to readily finetune on a wide variety of downstream tasks. We investigate the possibility of transferring such advances to molecular machine learning by building a chemical foundation model, ChemBERTa-2, using the language of SMILES. While labeled data for molecular prediction tasks is typically scarce, libraries of SMILES strings are readily available. In this work, we build upon ChemBERTa by optimizing the pretraining process. We compare multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size, up to 77M compounds from PubChem. To our knowledge, the 77M set constitutes one of the largest datasets used for molecular pretraining to date. We find that with these pretraining improvements, we are competitive with existing state-of-the-art architectures on the MoleculeNet benchmark suite. We analyze the degree to which improvements in pretraining translate to improvement on downstream tasks. △ Less

Submitted 4 September, 2022; originally announced September 2022.

Comments: ELLIS Machine Learning for Molecule Discovery Workshop

ACM Class: I.2.7; I.2.1; J.2; J.3

arXiv:2207.09163 [pdf, other]

Urdu Speech and Text Based Sentiment Analyzer

Authors: Waqar Ahmad, Maryam Edalati

Abstract: Discovering what other people think has always been a key aspect of our information-gathering strategy. People can now actively utilize information technology to seek out and comprehend the ideas of others, thanks to the increased availability and popularity of opinion-rich resources such as online review sites and personal blogs. Because of its crucial function in understanding people's opinions,… ▽ More Discovering what other people think has always been a key aspect of our information-gathering strategy. People can now actively utilize information technology to seek out and comprehend the ideas of others, thanks to the increased availability and popularity of opinion-rich resources such as online review sites and personal blogs. Because of its crucial function in understanding people's opinions, sentiment analysis (SA) is a crucial task. Existing research, on the other hand, is primarily focused on the English language, with just a small amount of study devoted to low-resource languages. For sentiment analysis, this work presented a new multi-class Urdu dataset based on user evaluations. The tweeter website was used to get Urdu dataset. Our proposed dataset includes 10,000 reviews that have been carefully classified into two categories by human experts: positive, negative. The primary purpose of this research is to construct a manually annotated dataset for Urdu sentiment analysis and to establish the baseline result. Five different lexicon- and rule-based algorithms including Naivebayes, Stanza, Textblob, Vader, and Flair are employed and the experimental results show that Flair with an accuracy of 70% outperforms other tested algorithms. △ Less

Submitted 19 July, 2022; originally announced July 2022.

Comments: Sentiment Analysis, Opinion Mining, Urdu language, polarity assessment, lexicon-based method

arXiv:2207.04055 [pdf, other]

Causal Discovery using Model Invariance through Knockoff Interventions

Authors: Wasim Ahmad, Maha Shadaydeh, Joachim Denzler

Abstract: Cause-effect analysis is crucial to understand the underlying mechanism of a system. We propose to exploit model invariance through interventions on the predictors to infer causality in nonlinear multivariate systems of time series. We model nonlinear interactions in time series using DeepAR and then expose the model to different environments using Knockoffs-based interventions to test model invar… ▽ More Cause-effect analysis is crucial to understand the underlying mechanism of a system. We propose to exploit model invariance through interventions on the predictors to infer causality in nonlinear multivariate systems of time series. We model nonlinear interactions in time series using DeepAR and then expose the model to different environments using Knockoffs-based interventions to test model invariance. Knockoff samples are pairwise exchangeable, in-distribution and statistically null variables generated without knowing the response. We test model invariance where we show that the distribution of the response residual does not change significantly upon interventions on non-causal predictors. We evaluate our method on real and synthetically generated time series. Overall our method outperforms other widely used causality methods, i.e, VAR Granger causality, VARLiNGAM and PCMCI+. △ Less

Submitted 8 July, 2022; originally announced July 2022.

Journal ref: ICML 2022: Workshop on Spurious Correlations, Invariance and Stability

arXiv:2206.07796 [pdf, other]

FixEval: Execution-based Evaluation of Program Fixes for Programming Problems

Authors: Md Mahim Anjum Haque, Wasi Uddin Ahmad, Ismini Lourentzou, Chris Brown

Abstract: The complexity of modern software has led to a drastic increase in the time and cost associated with detecting and rectifying software bugs. In response, researchers have explored various methods to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible fixes for any given bug, few tools and datasets are available to evaluate model-generated fixes ef… ▽ More The complexity of modern software has led to a drastic increase in the time and cost associated with detecting and rectifying software bugs. In response, researchers have explored various methods to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible fixes for any given bug, few tools and datasets are available to evaluate model-generated fixes effectively. To address this issue, we introduce FixEval, a benchmark comprising of buggy code submissions to competitive programming problems and their corresponding fixes. FixEval offers an extensive collection of unit tests to evaluate the correctness of model-generated program fixes and assess further information regarding time, memory constraints, and acceptance based on a verdict. We consider two Transformer language models pretrained on programming languages as our baseline and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately. At the same time, execution-based methods evaluate programs through all cases and scenarios designed explicitly for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation. The dataset and models are open-sourced at https://github.com/mahimanzum/FixEval. △ Less

Submitted 30 March, 2023; v1 submitted 15 June, 2022; originally announced June 2022.

arXiv:2205.11116 [pdf, other]

Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages

Authors: Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang

Abstract: Back-translation is widely known for its effectiveness in neural machine translation when there is little to no parallel data. In this approach, a source-to-target model is coupled with a target-to-source model trained in parallel. The target-to-source model generates noisy sources, while the source-to-target model is trained to reconstruct the targets and vice versa. Recent developments of multil… ▽ More Back-translation is widely known for its effectiveness in neural machine translation when there is little to no parallel data. In this approach, a source-to-target model is coupled with a target-to-source model trained in parallel. The target-to-source model generates noisy sources, while the source-to-target model is trained to reconstruct the targets and vice versa. Recent developments of multilingual pre-trained sequence-to-sequence models for programming languages have been very effective for a broad spectrum of downstream software engineering tasks. Hence, training them to build programming language translation systems via back-translation is compelling. However, these models cannot be further trained via back-translation since they learn to output sequences in the same language as the inputs during pre-training. As an alternative, we propose performing back-translation via code summarization and generation. In code summarization, a model learns to generate natural language (NL) summaries given code snippets. In code generation, the model learns to do the opposite. Therefore, target-to-source generation in back-translation can be viewed as a target-to-NL-to-source generation. We show that our proposed approach performs competitively with state-of-the-art methods. We have made the code publicly available. △ Less

Submitted 11 February, 2023; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: Accepted to EACL 2023 (Main)

arXiv:2205.11081 [pdf, other]

BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla

Authors: Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Rifat Shahriyar

Abstract: This work presents BanglaNLG, a comprehensive benchmark for evaluating natural language generation (NLG) models in Bangla, a widely spoken yet low-resource language. We aggregate six challenging conditional text generation tasks under the BanglaNLG benchmark, introducing a new dataset on dialogue generation in the process. Furthermore, using a clean corpus of 27.5 GB of Bangla data, we pretrain Ba… ▽ More This work presents BanglaNLG, a comprehensive benchmark for evaluating natural language generation (NLG) models in Bangla, a widely spoken yet low-resource language. We aggregate six challenging conditional text generation tasks under the BanglaNLG benchmark, introducing a new dataset on dialogue generation in the process. Furthermore, using a clean corpus of 27.5 GB of Bangla data, we pretrain BanglaT5, a sequence-to-sequence Transformer language model for Bangla. BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming several multilingual models by up to 9% absolute gain and 32% relative gain. We are making the new dialogue dataset and the BanglaT5 model publicly available at https://github.com/csebuetnlp/BanglaNLG in the hope of advancing future research on Bangla NLG. △ Less

Submitted 11 February, 2023; v1 submitted 23 May, 2022; originally announced May 2022.

Comments: Findings of EACL 2023 (camera-ready)

arXiv:2204.08952 [pdf, other]

Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies

Authors: Md Rizwan Parvez, Jianfeng Chi, Wasi Uddin Ahmad, Yuan Tian, Kai-Wei Chang

Abstract: Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever mode… ▽ More Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10\% F1) and achieve a new state-of-the-art F1 score of 50\%. Our ablation studies provide further insights into the effectiveness of our approach. △ Less

Submitted 22 April, 2023; v1 submitted 19 April, 2022; originally announced April 2022.

Comments: EACL 2023

arXiv:2203.08118 [pdf, other]

Representation Learning for Resource-Constrained Keyphrase Generation

Authors: Di Wu, Wasi Uddin Ahmad, Sunipa Dev, Kai-Wei Chang

Abstract: State-of-the-art keyphrase generation methods generally depend on large annotated datasets, limiting their performance in domains with limited annotated data. To overcome this challenge, we design a data-oriented approach that first identifies salient information using retrieval-based corpus-level statistics, and then learns a task-specific intermediate representation based on a pre-trained langua… ▽ More State-of-the-art keyphrase generation methods generally depend on large annotated datasets, limiting their performance in domains with limited annotated data. To overcome this challenge, we design a data-oriented approach that first identifies salient information using retrieval-based corpus-level statistics, and then learns a task-specific intermediate representation based on a pre-trained language model using large-scale unlabeled documents. We introduce salient span recovery and salient span prediction as denoising training objectives that condense the intra-article and inter-article knowledge essential for keyphrase generation. Through experiments on multiple keyphrase generation benchmarks, we show the effectiveness of the proposed approach for facilitating low-resource keyphrase generation and zero-shot domain adaptation. Our method especially benefits the generation of absent keyphrases, approaching the performance of models trained with large training sets. △ Less

Submitted 21 October, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

Comments: EMNLP 2022 (Findings)

arXiv:2112.08804 [pdf, other]

CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs

Authors: Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Yuan-Fang Li, Yong-Bin Kang, Rifat Shahriyar

Abstract: We present CrossSum, a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs. We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset and perform a controlled human evaluation to validate its quality. We propose a multistage da… ▽ More We present CrossSum, a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs. We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset and perform a controlled human evaluation to validate its quality. We propose a multistage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language. We also introduce LaSE, an embedding-based metric for automatically evaluating model-generated summaries. LaSE is strongly correlated with ROUGE and, unlike ROUGE, can be reliably measured even in the absence of references in the target language. Performance on ROUGE and LaSE indicate that our proposed model consistently outperforms baseline models. To the best of our knowledge, CrossSum is the largest cross-lingual summarization dataset and the first ever that is not centered around English. We are releasing the dataset, training and evaluation scripts, and models to spur future research on cross-lingual summarization. The resources can be found at https://github.com/csebuetnlp/CrossSum △ Less

Submitted 25 May, 2023; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: ACL 2023 (camera-ready)

arXiv:2109.10817 [pdf, other]

Causal Inference in Non-linear Time-series using Deep Networks and Knockoff Counterfactuals

Authors: Wasim Ahmad, Maha Shadaydeh, Joachim Denzler

Abstract: Estimating causal relations is vital in understanding the complex interactions in multivariate time series. Non-linear coupling of variables is one of the major challenges inaccurate estimation of cause-effect relations. In this paper, we propose to use deep autoregressive networks (DeepAR) in tandem with counterfactual analysis to infer nonlinear causal relations in multivariate time series. We e… ▽ More Estimating causal relations is vital in understanding the complex interactions in multivariate time series. Non-linear coupling of variables is one of the major challenges inaccurate estimation of cause-effect relations. In this paper, we propose to use deep autoregressive networks (DeepAR) in tandem with counterfactual analysis to infer nonlinear causal relations in multivariate time series. We extend the concept of Granger causality using probabilistic forecasting with DeepAR. Since deep networks can neither handle missing input nor out-of-distribution intervention, we propose to use the Knockoffs framework (Barberand Cand`es, 2015) for generating intervention variables and consequently counterfactual probabilistic forecasting. Knockoff samples are independent of their output given the observed variables and exchangeable with their counterpart variables without changing the underlying distribution of the data. We test our method on synthetic as well as real-world time series datasets. Overall our method outperforms the widely used vector autoregressive Granger causality and PCMCI in detecting nonlinear causal dependency in multivariate time series. △ Less

Submitted 18 October, 2021; v1 submitted 22 September, 2021; originally announced September 2021.

Journal ref: IEEE International Conference on Machine Learning and Applications (ICMLA) 2021

arXiv:2108.11601 [pdf, other]

Retrieval Augmented Code Generation and Summarization

Authors: Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang

Abstract: Software developers write a lot of source code and documentation during software development. Intrinsically, developers often recall parts of source code or code summaries that they had written in the past while implementing software or documenting them. To mimic developers' code or summary generation behavior, we propose a retrieval augmented framework, REDCODER, that retrieves relevant code or s… ▽ More Software developers write a lot of source code and documentation during software development. Intrinsically, developers often recall parts of source code or code summaries that they had written in the past while implementing software or documenting them. To mimic developers' code or summary generation behavior, we propose a retrieval augmented framework, REDCODER, that retrieves relevant code or summaries from a retrieval database and provides them as a supplement to code generation or summarization models. REDCODER has a couple of uniqueness. First, it extends the state-of-the-art dense retrieval technique to search for relevant code or summaries. Second, it can work with retrieval databases that include unimodal (only code or natural language description) or bimodal instances (code-description pairs). We conduct experiments and extensive analysis on two benchmark datasets of code generation and summarization in Java and Python, and the promising results endorse the effectiveness of our proposed retrieval augmented framework. △ Less

Submitted 10 September, 2021; v1 submitted 26 August, 2021; originally announced August 2021.

Comments: accepted in EMNLP-Findings 2021

arXiv:2108.11590 [pdf, other]

AVATAR: A Parallel Corpus for Java-Python Program Translation

Authors: Wasi Uddin Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, Kai-Wei Chang

Abstract: Program translation refers to migrating source code from one programming language to another. It has tremendous practical value in software development, as porting software across languages is time-consuming and costly. Automating program translation is of paramount importance in software migration, and recently researchers explored unsupervised approaches due to the unavailability of parallel cor… ▽ More Program translation refers to migrating source code from one programming language to another. It has tremendous practical value in software development, as porting software across languages is time-consuming and costly. Automating program translation is of paramount importance in software migration, and recently researchers explored unsupervised approaches due to the unavailability of parallel corpora. However, the availability of pre-trained language models for programming languages enables supervised fine-tuning with a small number of labeled examples. Therefore, we present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python. AVATAR is collected from competitive programming sites, online platforms, and open-source repositories. Furthermore, AVATAR includes unit tests for 250 examples to facilitate functional correctness evaluation. We benchmark several pre-trained language models fine-tuned on AVATAR. Experiment results show that the models lack in generating functionally accurate code. △ Less

Submitted 4 May, 2023; v1 submitted 26 August, 2021; originally announced August 2021.

Comments: Accepted to Findings of ACL 2023

arXiv:2106.02134 [pdf, other]

Syntax-augmented Multilingual BERT for Cross-lingual Transfer

Authors: Wasi Uddin Ahmad, Haoran Li, Kai-Wei Chang, Yashar Mehdad

Abstract: In recent years, we have seen a colossal effort in pre-training multilingual text encoders using large-scale corpora in many languages to facilitate cross-lingual transfer learning. However, due to typological differences across languages, the cross-lingual transfer is challenging. Nevertheless, language syntax, e.g., syntactic dependencies, can bridge the typological gap. Previous works have show… ▽ More In recent years, we have seen a colossal effort in pre-training multilingual text encoders using large-scale corpora in many languages to facilitate cross-lingual transfer learning. However, due to typological differences across languages, the cross-lingual transfer is challenging. Nevertheless, language syntax, e.g., syntactic dependencies, can bridge the typological gap. Previous works have shown that pre-trained multilingual encoders, such as mBERT \cite{devlin-etal-2019-bert}, capture language syntax, helping cross-lingual transfer. This work shows that explicitly providing language syntax and training mBERT using an auxiliary objective to encode the universal dependency tree structure helps cross-lingual transfer. We perform rigorous experiments on four NLP tasks, including text classification, question answering, named entity recognition, and task-oriented semantic parsing. The experiment results show that syntax-augmented mBERT improves cross-lingual transfer on popular benchmarks, such as PAWS-X and MLQA, by 1.4 and 1.6 points on average across all languages. In the \emph{generalized} transfer setting, the performance boosted significantly, with 3.9 and 3.1 points on average in PAWS-X and MLQA. △ Less

Submitted 3 June, 2021; originally announced June 2021.

Comments: ACL 2021 (camera ready)

arXiv:2105.14220 [pdf, other]

CoDesc: A Large Code-Description Parallel Dataset

Authors: Masum Hasan, Tanveer Muttaqueen, Abdullah Al Ishtiaq, Kazi Sajeed Mehrab, Md. Mahim Anjum Haque, Tahmid Hasan, Wasi Uddin Ahmad, Anindya Iqbal, Rifat Shahriyar

Abstract: Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal method… ▽ More Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal methods, and evaluation benchmarks. This leaves researchers to collect new small-scale datasets, resulting in inconsistencies across published works. In this study, we present CoDesc -- a large parallel dataset composed of 4.2 million Java methods and natural language descriptions. With extensive analysis, we identify and remove prevailing noise patterns from the dataset. We demonstrate the proficiency of CoDesc in two complementary tasks for code-description pairs: code summarization and code search. We show that the dataset helps improve code search by up to 22\% and achieves the new state-of-the-art in code summarization. Furthermore, we show CoDesc's effectiveness in pre-training--fine-tuning setup, opening possibilities in building pretrained language models for Java. To facilitate future research, we release the dataset, a data processing tool, and a benchmark at \url{https://github.com/csebuetnlp/CoDesc}. △ Less

Submitted 29 May, 2021; originally announced May 2021.

Comments: Findings of the Association for Computational Linguistics, ACL 2021 (camera-ready)

arXiv:2104.08645 [pdf, other]

Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training

Authors: Kuan-Hao Huang, Wasi Uddin Ahmad, Nanyun Peng, Kai-Wei Chang

Abstract: Pre-trained multilingual language encoders, such as multilingual BERT and XLM-R, show great potential for zero-shot cross-lingual transfer. However, these multilingual encoders do not precisely align words and phrases across languages. Especially, learning alignments in the multilingual embedding space usually requires sentence-level or word-level parallel corpora, which are expensive to be obtain… ▽ More Pre-trained multilingual language encoders, such as multilingual BERT and XLM-R, show great potential for zero-shot cross-lingual transfer. However, these multilingual encoders do not precisely align words and phrases across languages. Especially, learning alignments in the multilingual embedding space usually requires sentence-level or word-level parallel corpora, which are expensive to be obtained for low-resource languages. An alternative is to make the multilingual encoders more robust; when fine-tuning the encoder using downstream task, we train the encoder to tolerate noise in the contextual embedding spaces such that even if the representations of different languages are not aligned well, the model can still achieve good performance on zero-shot cross-lingual transfer. In this work, we propose a learning strategy for training robust models by drawing connections between adversarial examples and the failure cases of zero-shot cross-lingual transfer. We adopt two widely used robust training methods, adversarial training and randomized smoothing, to train the desired robust model. The experimental results demonstrate that robust training improves zero-shot cross-lingual transfer on text classification tasks. The improvement is more significant in the generalized cross-lingual transfer setting, where the pair of input sentences belong to two different languages. △ Less

Submitted 10 September, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

Comments: EMNLP 2021

arXiv:2104.08301 [pdf, other]

Text2App: A Framework for Creating Android Apps from Text Descriptions

Authors: Masum Hasan, Kazi Sajeed Mehrab, Wasi Uddin Ahmad, Rifat Shahriyar

Abstract: We present Text2App -- a framework that allows users to create functional Android applications from natural language specifications. The conventional method of source code generation tries to generate source code directly, which is impractical for creating complex software. We overcome this limitation by transforming natural language into an abstract intermediate formal language representing an ap… ▽ More We present Text2App -- a framework that allows users to create functional Android applications from natural language specifications. The conventional method of source code generation tries to generate source code directly, which is impractical for creating complex software. We overcome this limitation by transforming natural language into an abstract intermediate formal language representing an application with a substantially smaller number of tokens. The intermediate formal representation is then compiled into target source codes. This abstraction of programming details allows seq2seq networks to learn complex application structures with less overhead. In order to train sequence models, we introduce a data synthesis method grounded in a human survey. We demonstrate that Text2App generalizes well to unseen combination of app components and it is capable of handling noisy natural language instructions. We explore the possibility of creating applications from highly abstract instructions by coupling our system with GPT-3 -- a large pretrained language model. We perform an extensive human evaluation and identify the capabilities and limitations of our system. The source code, a ready-to-run demo notebook, and a demo video are publicly available at \url{https://github.com/text2app/Text2App}. △ Less

Submitted 7 July, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: Submitted to EMNLP 2021 System Demonstrations

arXiv:2103.06333 [pdf, other]

Unified Pre-training for Program Understanding and Generation

Authors: Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang

Abstract: Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collect… ▽ More Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART's effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations. △ Less

Submitted 10 April, 2021; v1 submitted 10 March, 2021; originally announced March 2021.

Comments: NAACL 2021 (camera ready)

arXiv:2101.00204 [pdf, other]

BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla

Authors: Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Kazi Samin, Md Saiful Islam, Anindya Iqbal, M. Sohel Rahman, Rifat Shahriyar

Abstract: In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answ… ▽ More In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at https://github.com/csebuetnlp/banglabert to advance Bangla NLP. △ Less

Submitted 10 May, 2022; v1 submitted 1 January, 2021; originally announced January 2021.

Comments: Findings of North American Chapter of the Association for Computational Linguistics, NAACL 2022 (camera-ready)

arXiv:2101.00123 [pdf, other]

Intent Classification and Slot Filling for Privacy Policies

Authors: Wasi Uddin Ahmad, Jianfeng Chi, Tu Le, Thomas Norton, Yuan Tian, Kai-Wei Chang

Abstract: Understanding privacy policies is crucial for users as it empowers them to learn about the information that matters to them. Sentences written in a privacy policy document explain privacy practices, and the constituent text spans convey further specific information about that practice. We refer to predicting the privacy practice explained in a sentence as intent classification and identifying the… ▽ More Understanding privacy policies is crucial for users as it empowers them to learn about the information that matters to them. Sentences written in a privacy policy document explain privacy practices, and the constituent text spans convey further specific information about that practice. We refer to predicting the privacy practice explained in a sentence as intent classification and identifying the text spans sharing specific information as slot filling. In this work, we propose PolicyIE, an English corpus consisting of 5,250 intent and 11,788 slot annotations spanning 31 privacy policies of websites and mobile applications. PolicyIE corpus is a challenging real-world benchmark with limited labeled examples reflecting the cost of collecting large-scale annotations from domain experts. We present two alternative neural approaches as baselines, (1) intent classification and slot filling as a joint sequence tagging and (2) modeling them as a sequence-to-sequence (Seq2Seq) learning task. The experiment results show that both approaches perform comparably in intent classification, while the Seq2Seq method outperforms the sequence tagging approach in slot filling by a large margin. We perform a detailed error analysis to reveal the challenges of the proposed corpus. △ Less

Submitted 4 June, 2021; v1 submitted 31 December, 2020; originally announced January 2021.

Comments: ACL 2021 (camera ready)

arXiv:2012.07701 [pdf, other]

Simple or Complex? Learning to Predict Readability of Bengali Texts

Authors: Susmoy Chakraborty, Mir Tafseer Nayeem, Wasi Uddin Ahmad

Abstract: Determining the readability of a text is the first step to its simplification. In this paper, we present a readability analysis tool capable of analyzing text written in the Bengali language to provide in-depth information on its readability and complexity. Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources… ▽ More Determining the readability of a text is the first step to its simplification. In this paper, we present a readability analysis tool capable of analyzing text written in the Bengali language to provide in-depth information on its readability and complexity. Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing. Readability related research of the Bengali language so far can be considered to be narrow and sometimes faulty due to the lack of resources. Therefore, we correctly adopt document-level readability formulas traditionally used for U.S. based education system to the Bengali language with a proper age-to-age comparison. Due to the unavailability of large-scale human-annotated corpora, we further divide the document-level task into sentence-level and experiment with neural architectures, which will serve as a baseline for the future works of Bengali readability prediction. During the process, we present several human-annotated corpora and dictionaries such as a document-level dataset comprising 618 documents with 12 different grade levels, a large-scale sentence-level dataset comprising more than 96K sentences with simple and complex labels, a consonant conjunct count algorithm and a corpus of 341 words to validate the effectiveness of the algorithm, a list of 3,396 easy words, and an updated pronunciation dictionary with more than 67K words. These resources can be useful for several other tasks of this low-resource language. We make our Code & Dataset publicly available at https://github.com/tafseer-nayeem/BengaliReadability} for reproduciblity. △ Less

Submitted 8 December, 2020; originally announced December 2020.

Comments: Accepted for publication at AAAI 2021

arXiv:2010.03009 [pdf, other]

GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction

Authors: Wasi Uddin Ahmad, Nanyun Peng, Kai-Wei Chang

Abstract: Recent progress in cross-lingual relation and event extraction use graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations such that models trained on one language can be applied to other languages. However, GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree. To address thes… ▽ More Recent progress in cross-lingual relation and event extraction use graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations such that models trained on one language can be applied to other languages. However, GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree. To address these challenges, we propose to utilize the self-attention mechanism where we explicitly fuse structural information to learn the dependencies between words with different syntactic distances. We introduce GATE, a {\bf G}raph {\bf A}ttention {\bf T}ransformer {\bf E}ncoder, and test its cross-lingual transferability on relation and event extraction tasks. We perform experiments on the ACE05 dataset that includes three typologically different languages: English, Chinese, and Arabic. The evaluation results show that GATE outperforms three recently proposed methods by a large margin. Our detailed analysis reveals that due to the reliance on syntactic dependencies, GATE produces robust representations that facilitate transfer across languages. △ Less

Submitted 17 February, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

Comments: AAAI 2021

arXiv:2010.02557 [pdf, other]

PolicyQA: A Reading Comprehension Dataset for Privacy Policies

Authors: Wasi Uddin Ahmad, Jianfeng Chi, Yuan Tian, Kai-Wei Chang

Abstract: Privacy policy documents are long and verbose. A question answering (QA) system can assist users in finding the information that is relevant and important to them. Prior studies in this domain frame the QA task as retrieving the most relevant text segment or a list of sentences from the policy document given a question. On the contrary, we argue that providing users with a short text span from pol… ▽ More Privacy policy documents are long and verbose. A question answering (QA) system can assist users in finding the information that is relevant and important to them. Prior studies in this domain frame the QA task as retrieving the most relevant text segment or a list of sentences from the policy document given a question. On the contrary, we argue that providing users with a short text span from policy documents reduces the burden of searching the target information from a lengthy text segment. In this paper, we present PolicyQA, a dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies. PolicyQA provides 714 human-annotated questions written for a wide range of privacy practices. We evaluate two existing neural QA models and perform rigorous analysis to reveal the advantages and challenges offered by PolicyQA. △ Less

Submitted 6 October, 2020; originally announced October 2020.

Comments: EMNLP Findings 2020 (short paper)

arXiv:2008.01739 [pdf, other]

Select, Extract and Generate: Neural Keyphrase Generation with Layer-wise Coverage Attention

Authors: Wasi Uddin Ahmad, Xiao Bai, Soomin Lee, Kai-Wei Chang

Abstract: Natural language processing techniques have demonstrated promising results in keyphrase generation. However, one of the major challenges in \emph{neural} keyphrase generation is processing long documents using deep neural networks. Generally, documents are truncated before given as inputs to neural networks. Consequently, the models may miss essential points conveyed in the target document. To ove… ▽ More Natural language processing techniques have demonstrated promising results in keyphrase generation. However, one of the major challenges in \emph{neural} keyphrase generation is processing long documents using deep neural networks. Generally, documents are truncated before given as inputs to neural networks. Consequently, the models may miss essential points conveyed in the target document. To overcome this limitation, we propose \emph{SEG-Net}, a neural keyphrase generation model that is composed of two major components, (1) a selector that selects the salient sentences in a document and (2) an extractor-generator that jointly extracts and generates keyphrases from the selected sentences. SEG-Net uses Transformer, a self-attentive architecture, as the basic building block with a novel \emph{layer-wise} coverage attention to summarize most of the points discussed in the document. The experimental results on seven keyphrase generation benchmarks from scientific and web documents demonstrate that SEG-Net outperforms the state-of-the-art neural generative methods by a large margin. △ Less

Submitted 4 June, 2021; v1 submitted 4 August, 2020; originally announced August 2020.

Comments: ACL 2021 (camera ready)

arXiv:2005.00653 [pdf, ps, other]

A Transformer-based Approach for Source Code Summarization

Authors: Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang

Abstract: Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range dependencies is crucial. To learn code representation for summarization, we explore the Transformer model that uses a self-attention mechanism and has shown… ▽ More Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range dependencies is crucial. To learn code representation for summarization, we explore the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies. In this work, we show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin. We perform extensive analysis and ablation studies that reveal several important findings, e.g., the absolute encoding of source code tokens' position hinders, while relative encoding significantly improves the summarization performance. We have made our code publicly available to facilitate future research. △ Less

Submitted 1 May, 2020; originally announced May 2020.

Comments: This paper is accepted at ACL2020

arXiv:2004.14384 [pdf, other]

A Formally Verified HOL4 Algebra for Event Trees

Authors: Mohamed Abdelghany, Waqar Ahmad, Sofiene Tahar

Abstract: Event Tree (ET) analysis is widely used as a forward deductive safety analysis technique for decision-making at the critical-system design stage. ET is a schematic diagram representing all possible operating states and external events in a system so that one of these possible scenarios can occur. In this report, we propose to use the HOL4 theorem prover for the formal modeling and step-analysis of… ▽ More Event Tree (ET) analysis is widely used as a forward deductive safety analysis technique for decision-making at the critical-system design stage. ET is a schematic diagram representing all possible operating states and external events in a system so that one of these possible scenarios can occur. In this report, we propose to use the HOL4 theorem prover for the formal modeling and step-analysis of ET diagrams. To this end, we developed a formalization of ETs in higher-order logic, which is based on a generic list datatype that can: (i) construct an arbitrary level of ET diagrams; (ii) reduce the irrelevant ET branches; (iii) partition ET paths; and (iv) perform the probabilistic analysis based on the occurrence of certain events. For illustration purposes, we conduct the formal ET stepwise analysis of an electrical power grid and also determine its System Average Interruption Frequency Index (SAIFI), which is an important indicator for system reliability. △ Less

Submitted 29 April, 2020; originally announced April 2020.

Comments: 17 pages, 3 figures

arXiv:2003.06327 [pdf]

doi 10.1109/ICET48972.2019.8994412

Human Activity Recognition using Multi-Head CNN followed by LSTM

Authors: Waqar Ahmad, Misbah Kazmi, Hazrat Ali

Abstract: This study presents a novel method to recognize human physical activities using CNN followed by LSTM. Achieving high accuracy by traditional machine learning algorithms, (such as SVM, KNN and random forest method) is a challenging task because the data acquired from the wearable sensors like accelerometer and gyroscope is a time-series data. So, to achieve high accuracy, we propose a multi-head CN… ▽ More This study presents a novel method to recognize human physical activities using CNN followed by LSTM. Achieving high accuracy by traditional machine learning algorithms, (such as SVM, KNN and random forest method) is a challenging task because the data acquired from the wearable sensors like accelerometer and gyroscope is a time-series data. So, to achieve high accuracy, we propose a multi-head CNN model comprising of three CNNs to extract features for the data acquired from different sensors and all three CNNs are then merged, which are followed by an LSTM layer and a dense layer. The configuration of all three CNNs is kept the same so that the same number of features are obtained for every input to CNN. By using the proposed method, we achieve state-of-the-art accuracy, which is comparable to traditional machine learning algorithms and other deep neural network algorithms. △ Less

Submitted 21 February, 2020; originally announced March 2020.

Comments: IEEE ICET 2019

arXiv:1909.09265 [pdf, other]

Cross-lingual Dependency Parsing with Unlabeled Auxiliary Languages

Authors: Wasi Uddin Ahmad, Zhisong Zhang, Xuezhe Ma, Kai-Wei Chang, Nanyun Peng

Abstract: Cross-lingual transfer learning has become an important weapon to battle the unavailability of annotated resources for low-resource languages. One of the fundamental techniques to transfer across languages is learning \emph{language-agnostic} representations, in the form of word embeddings or contextual encodings. In this work, we propose to leverage unannotated sentences from auxiliary languages… ▽ More Cross-lingual transfer learning has become an important weapon to battle the unavailability of annotated resources for low-resource languages. One of the fundamental techniques to transfer across languages is learning \emph{language-agnostic} representations, in the form of word embeddings or contextual encodings. In this work, we propose to leverage unannotated sentences from auxiliary languages to help learning language-agnostic representations. Specifically, we explore adversarial training for learning contextual encoders that produce invariant representations across languages to facilitate cross-lingual transfer. We conduct experiments on cross-lingual dependency parsing where we train a dependency parser on a source language and transfer it to a wide range of target languages. Experiments on 28 target languages demonstrate that adversarial training significantly improves the overall transfer performances under several different settings. We conduct a careful analysis to evaluate the language-agnostic representations resulted from adversarial training. △ Less

Submitted 19 September, 2019; originally announced September 2019.

Comments: CoNLL 2019

arXiv:1906.02329 [pdf, other]

Context Attentive Document Ranking and Query Suggestion

Authors: Wasi Uddin Ahmad, Kai-Wei Chang, Hongning Wang

Abstract: We present a context-aware neural ranking model to exploit users' on-task search activities and enhance retrieval performance. In particular, a two-level hierarchical recurrent neural network is introduced to learn search context representation of individual queries, search tasks, and corresponding dependency structure by jointly optimizing two companion retrieval tasks: document ranking and query… ▽ More We present a context-aware neural ranking model to exploit users' on-task search activities and enhance retrieval performance. In particular, a two-level hierarchical recurrent neural network is introduced to learn search context representation of individual queries, search tasks, and corresponding dependency structure by jointly optimizing two companion retrieval tasks: document ranking and query suggestion. To identify the variable dependency structure between search context and users' ongoing search activities, attention at both levels of recurrent states are introduced. Extensive experiment comparisons against a rich set of baseline methods and an in-depth ablation analysis confirm the value of our proposed approach for modeling search context buried in search tasks. △ Less

Submitted 5 June, 2019; originally announced June 2019.

Comments: Accepted to SIGIR 2019

arXiv:1901.10702 [pdf, ps, other]

Lower Bound on the Sum-rate of Decremental Beam Selection Algorithm for Beamspace MIMO Systems

Authors: Naveed Iqbal, Waqas Ahmad, Christian Schneider, Reiner S. Thomä

Abstract: In general, the zero-forcing (ZF) precoding suffers from a severe receive signal-to-noise ratio (SNR) degradation in the high interference regime. However, recent evidences from realistic measurements demonstrated that millimeter wave (mmWave) systems are mainly noise-limited as high gain antennas behave as spatial filters to the interference signal. This makes ZF precoding equally attractive as t… ▽ More In general, the zero-forcing (ZF) precoding suffers from a severe receive signal-to-noise ratio (SNR) degradation in the high interference regime. However, recent evidences from realistic measurements demonstrated that millimeter wave (mmWave) systems are mainly noise-limited as high gain antennas behave as spatial filters to the interference signal. This makes ZF precoding equally attractive as that of other linear precoding counterparts. Considering ZF precoding, this paper aims to derive a lower bound on the sum-rate achieved by a decremental beam selection (BS) algorithm in a beamspace MIMO (B-MIMO) system operating at mmWave frequencies. This bound relates Frobenious norms of precoding matrices of full and reduced dimensional (i.e. after BS) B-MIMO systems through a deterministic square-hyperbolic function. Note that, both ZF precoding and decremental BS are not new concepts. However, the derived sum-rate bound provides a new insight to the topic. Given a particular full dimensional B-MIMO channel, the presented bound can be used to understand limits of BS algorithms. △ Less

Submitted 30 January, 2019; originally announced January 2019.

arXiv:1811.00570 [pdf, other]

On Difficulties of Cross-Lingual Transfer with Order Differences: A Case Study on Dependency Parsing

Authors: Wasi Uddin Ahmad, Zhisong Zhang, Xuezhe Ma, Eduard Hovy, Kai-Wei Chang, Nanyun Peng

Abstract: Different languages might have different word orders. In this paper, we investigate cross-lingual transfer and posit that an order-agnostic model will perform better when transferring to distant foreign languages. To test our hypothesis, we train dependency parsers on an English corpus and evaluate their transfer performance on 30 other languages. Specifically, we compare encoders and decoders bas… ▽ More Different languages might have different word orders. In this paper, we investigate cross-lingual transfer and posit that an order-agnostic model will perform better when transferring to distant foreign languages. To test our hypothesis, we train dependency parsers on an English corpus and evaluate their transfer performance on 30 other languages. Specifically, we compare encoders and decoders based on Recurrent Neural Networks (RNNs) and modified self-attentive architectures. The former relies on sequential information while the latter is more flexible at modeling word order. Rigorous experiments and detailed analysis shows that RNN-based architectures transfer well to languages that are close to English, while self-attentive models have better overall cross-lingual transferability and perform especially well on distant languages. △ Less

Submitted 16 April, 2019; v1 submitted 1 November, 2018; originally announced November 2018.

Comments: Accepted by NAACL-2019

arXiv:1810.00681 [pdf, other]

Learning Robust, Transferable Sentence Representations for Text Classification

Authors: Wasi Uddin Ahmad, Xueying Bai, Nanyun Peng, Kai-Wei Chang

Abstract: Despite deep recurrent neural networks (RNNs) demonstrate strong performance in text classification, training RNN models are often expensive and requires an extensive collection of annotated data which may not be available. To overcome the data limitation issue, existing approaches leverage either pre-trained word embedding or sentence representation to lift the burden of training RNNs from scratc… ▽ More Despite deep recurrent neural networks (RNNs) demonstrate strong performance in text classification, training RNN models are often expensive and requires an extensive collection of annotated data which may not be available. To overcome the data limitation issue, existing approaches leverage either pre-trained word embedding or sentence representation to lift the burden of training RNNs from scratch. In this paper, we show that jointly learning sentence representations from multiple text classification tasks and combining them with pre-trained word-level and sentence level encoders result in robust sentence representations that are useful for transfer learning. Extensive experiments and analyses using a wide range of transfer and linguistic tasks endorse the effectiveness of our approach. △ Less

Submitted 28 September, 2018; originally announced October 2018.

Comments: arXiv admin note: substantial text overlap with arXiv:1804.07911

arXiv:1809.02887 [pdf, ps, other]

doi 10.1109/MESA.2018.8449144

Accelerating Viterbi Algorithm using Custom Instruction Approach

Authors: Waqar Ahmad, Imran Hafeez Abbassi, Usman Sanwal, Hasan Mahmood

Abstract: In recent years, the decoding algorithms in communication networks are becoming increasingly complex aiming to achieve high reliability in correctly decoding received messages. These decoding algorithms involve computationally complex operations requiring high performance computing hardware, which are generally expensive. A cost-effective solution is to enhance the Instruction Set Architecture (IS… ▽ More In recent years, the decoding algorithms in communication networks are becoming increasingly complex aiming to achieve high reliability in correctly decoding received messages. These decoding algorithms involve computationally complex operations requiring high performance computing hardware, which are generally expensive. A cost-effective solution is to enhance the Instruction Set Architecture (ISA) of the processors by creating new custom instructions for the computational parts of the decoding algorithms. In this paper, we propose to utilize the custom instruction approach to efficiently implement the widely used Viterbi decoding algorithm by adding the assembly language instructions to the ISA of DLX, PicoJava II and NIOS II processors, which represent RISC, stack and FPGA-based soft-core processor architectures, respectively. By using the custom instruction approach, the execution time of the Viterbi algorithm is significantly improved by approximately 3 times for DLX and PicoJava II, and by 2 times for NIOS II. △ Less

Submitted 8 September, 2018; originally announced September 2018.

Comments: 7 Pages, 4 Figures, 2018 14th IEEE/ASME International Conference on Mechatronic and Embedded Systems and Applications (MESA)

Showing 1–50 of 57 results for author: Ahmad, W