Search | arXiv e-print repository

PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

Authors: Branden Butler, Sixing Yu, Arya Mazaheri, Ali Jannesari

Abstract: Inference of Large Language Models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce bottlenecks associated with memory bandwidth, but also increase end-to-end latency per inference run, requiring high speculation acceptance rates to improve performance.… ▽ More Inference of Large Language Models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce bottlenecks associated with memory bandwidth, but also increase end-to-end latency per inference run, requiring high speculation acceptance rates to improve performance. Combined with a variable rate of acceptance across tasks, speculative inference techniques can result in reduced performance. Additionally, pipeline-parallel designs require many user requests to maintain maximum utilization. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects. PipeInfer exhibits up to a 2.15$\times$ improvement in generation speed over standard speculative inference. PipeInfer achieves its improvement through Continuous Asynchronous Speculation and Early Inference Cancellation, the former improving latency and generation speed by running single-token inference simultaneously with several speculative runs, while the latter improves speed and latency by skipping the computation of invalidated runs, even in the middle of inference. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: 11 pages, submitted to SC24 conference

arXiv:2407.07321 [pdf, other]

RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension

Authors: Hung Phan, Anurag Acharya, Sarthak Chaturvedi, Shivam Sharma, Mike Parker, Dan Nally, Ali Jannesari, Karl Pazdernik, Mahantesh Halappanavar, Sai Munikoti, Sameera Horawalavithana

Abstract: Large Language Models (LLMs) have been applied to many research problems across various domains. One of the applications of LLMs is providing question-answering systems that cater to users from different fields. The effectiveness of LLM-based question-answering systems has already been established at an acceptable level for users posing questions in popular and public domains such as trivia and li… ▽ More Large Language Models (LLMs) have been applied to many research problems across various domains. One of the applications of LLMs is providing question-answering systems that cater to users from different fields. The effectiveness of LLM-based question-answering systems has already been established at an acceptable level for users posing questions in popular and public domains such as trivia and literature. However, it has not often been established in niche domains that traditionally require specialized expertise. To this end, we construct the NEPAQuAD1.0 benchmark to evaluate the performance of three frontier LLMs -- Claude Sonnet, Gemini, and GPT-4 -- when answering questions originating from Environmental Impact Statements prepared by U.S. federal government agencies in accordance with the National Environmental Environmental Act (NEPA). We specifically measure the ability of LLMs to understand the nuances of legal, technical, and compliance-related information present in NEPA documents in different contextual scenarios. For example, we test the LLMs' internal prior NEPA knowledge by providing questions without any context, as well as assess how LLMs synthesize the contextual information present in long NEPA documents to facilitate the question/answering task. We compare the performance of the long context LLMs and RAG powered models in handling different types of questions (e.g., problem-solving, divergent). Our results suggest that RAG powered models significantly outperform the long context models in the answer accuracy regardless of the choice of the frontier LLM. Our further analysis reveals that many models perform better answering closed questions than divergent and problem-solving questions. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 14 pages

arXiv:2407.02238 [pdf, other]

MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations

Authors: Akash Dutta, Ali Jannesari

Abstract: One of the primary areas of interest in High Performance Computing is the improvement of performance of parallel workloads. Nowadays, compilable source code-based optimization tasks that employ deep learning often exploit LLVM Intermediate Representations (IRs) for extracting features from source code. Most such works target specific tasks, or are designed with a pre-defined set of heuristics. So… ▽ More One of the primary areas of interest in High Performance Computing is the improvement of performance of parallel workloads. Nowadays, compilable source code-based optimization tasks that employ deep learning often exploit LLVM Intermediate Representations (IRs) for extracting features from source code. Most such works target specific tasks, or are designed with a pre-defined set of heuristics. So far, pre-trained models are rare in this domain, but the possibilities have been widely discussed. Especially approaches mimicking large-language models (LLMs) have been proposed. But these have prohibitively large training costs. In this paper, we propose MIREncoder, a M}ulti-modal IR-based Auto-Encoder that can be pre-trained to generate a learned embedding space to be used for downstream tasks by machine learning-based approaches. A multi-modal approach enables us to better extract features from compilable programs. It allows us to better model code syntax, semantics and structure. For code-based performance optimizations, these features are very important while making optimization decisions. A pre-trained model/embedding implicitly enables the usage of transfer learning, and helps move away from task-specific trained models. Additionally, a pre-trained model used for downstream performance optimization should itself have reduced overhead, and be easily usable. These considerations have led us to propose a modeling approach that i) understands code semantics and structure, ii) enables use of transfer learning, and iii) is small and simple enough to be easily re-purposed or reused even with low resource availability. Our evaluations will show that our proposed approach can outperform the state of the art while reducing overhead. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: 12 pages, 6 figures, 9 tables, PACT '24 conference

arXiv:2406.13881 [pdf, other]

Static Generation of Efficient OpenMP Offload Data Mappings

Authors: Luke Marzen, Akash Dutta, Ali Jannesari

Abstract: Increasing heterogeneity in HPC architectures and compiler advancements have led to OpenMP being frequently used to enable computations on heterogeneous devices. However, the efficient movement of data on heterogeneous computing platforms is crucial for achieving high utilization. Programmers must explicitly map data between the host and connected accelerator devices to achieve efficient data move… ▽ More Increasing heterogeneity in HPC architectures and compiler advancements have led to OpenMP being frequently used to enable computations on heterogeneous devices. However, the efficient movement of data on heterogeneous computing platforms is crucial for achieving high utilization. Programmers must explicitly map data between the host and connected accelerator devices to achieve efficient data movement. Ensuring efficient data transfer requires programmers to reason about complex data flow. This can be a laborious and error-prone process since the programmer must keep a mental model of data validity and lifetime spanning multiple data environments. We present a static analysis tool, OMPDart (OpenMP Data Reduction Tool), for OpenMP programs that models data dependencies between host and device regions and applies source code transformations to achieve efficient data transfer. Our evaluations on nine HPC benchmarks demonstrate that OMPDart is capable of generating effective data mapping constructs that substantially reduce data transfer between host and device. △ Less

Submitted 26 August, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

Comments: Accepted to the 2024 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24)

arXiv:2404.15182 [pdf, other]

FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning

Authors: Duy Phuong Nguyen, J. Pablo Munoz, Ali Jannesari

Abstract: In the rapidly evolving field of artificial intelligence, multimodal models, e.g., integrating vision and language into visual-language models (VLMs), have become pivotal for many applications, ranging from image captioning to multimodal search engines. Among these models, the Contrastive Language-Image Pre-training (CLIP) model has demonstrated remarkable performance in understanding and generati… ▽ More In the rapidly evolving field of artificial intelligence, multimodal models, e.g., integrating vision and language into visual-language models (VLMs), have become pivotal for many applications, ranging from image captioning to multimodal search engines. Among these models, the Contrastive Language-Image Pre-training (CLIP) model has demonstrated remarkable performance in understanding and generating nuanced relationships between text and images. However, the conventional training of such models often requires centralized aggregation of vast datasets, posing significant privacy and data governance challenges. To address these concerns, this paper proposes a novel approach that leverages Federated Learning and parameter-efficient adapters, i.e., Low-Rank Adaptation (LoRA), to train VLMs. This methodology preserves data privacy by training models across decentralized data sources and ensures model adaptability and efficiency through LoRA's parameter-efficient fine-tuning. Our approach accelerates training time by up to 34.72 times and requires 2.47 times less memory usage than full fine-tuning. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 10 pages, 11 figures

arXiv:2404.06638 [pdf, other]

SAM-I-Am: Semantic Boosting for Zero-shot Atomic-Scale Electron Micrograph Segmentation

Authors: Waqwoya Abebe, Jan Strube, Luanzheng Guo, Nathan R. Tallent, Oceane Bel, Steven Spurgeon, Christina Doty, Ali Jannesari

Abstract: Image segmentation is a critical enabler for tasks ranging from medical diagnostics to autonomous driving. However, the correct segmentation semantics - where are boundaries located? what segments are logically similar? - change depending on the domain, such that state-of-the-art foundation models can generate meaningless and incorrect results. Moreover, in certain domains, fine-tuning and retrain… ▽ More Image segmentation is a critical enabler for tasks ranging from medical diagnostics to autonomous driving. However, the correct segmentation semantics - where are boundaries located? what segments are logically similar? - change depending on the domain, such that state-of-the-art foundation models can generate meaningless and incorrect results. Moreover, in certain domains, fine-tuning and retraining techniques are infeasible: obtaining labels is costly and time-consuming; domain images (micrographs) can be exponentially diverse; and data sharing (for third-party retraining) is restricted. To enable rapid adaptation of the best segmentation technology, we propose the concept of semantic boosting: given a zero-shot foundation model, guide its segmentation and adjust results to match domain expectations. We apply semantic boosting to the Segment Anything Model (SAM) to obtain microstructure segmentation for transmission electron microscopy. Our booster, SAM-I-Am, extracts geometric and textural features of various intermediate masks to perform mask removal and mask merging operations. We demonstrate a zero-shot performance increase of (absolute) +21.35%, +12.6%, +5.27% in mean IoU, and a -9.91%, -18.42%, -4.06% drop in mean false positive masks across images of three difficulty classes over vanilla SAM (ViT-L). △ Less

Submitted 10 May, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2403.07231 [pdf, other]

Learn and Search: An Elegant Technique for Object Lookup using Contrastive Learning

Authors: Chandan Kumar, Jansel Herrera-Gerena, John Just, Matthew Darr, Ali Jannesari

Abstract: The rapid proliferation of digital content and the ever-growing need for precise object recognition and segmentation have driven the advancement of cutting-edge techniques in the field of object classification and segmentation. This paper introduces "Learn and Search", a novel approach for object lookup that leverages the power of contrastive learning to enhance the efficiency and effectiveness of… ▽ More The rapid proliferation of digital content and the ever-growing need for precise object recognition and segmentation have driven the advancement of cutting-edge techniques in the field of object classification and segmentation. This paper introduces "Learn and Search", a novel approach for object lookup that leverages the power of contrastive learning to enhance the efficiency and effectiveness of retrieval systems. In this study, we present an elegant and innovative methodology that integrates deep learning principles and contrastive learning to tackle the challenges of object search. Our extensive experimentation reveals compelling results, with "Learn and Search" achieving superior Similarity Grid Accuracy, showcasing its efficacy in discerning regions of utmost similarity within an image relative to a cropped image. The seamless fusion of deep learning and contrastive learning to address the intricacies of object identification not only promises transformative applications in image recognition, recommendation systems, and content tagging but also revolutionizes content-based search and retrieval. The amalgamation of these techniques, as exemplified by "Learn and Search," represents a significant stride in the ongoing evolution of methodologies in the dynamic realm of object classification and segmentation. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: 9 pages, 4 figures

arXiv:2403.02518 [pdf, other]

MPI Errors Detection using GNN Embedding and Vector Embedding over LLVM IR

Authors: Jad El Karchi, Hanze Chen, Ali TehraniJamsaz, Ali Jannesari, Mihail Popov, Emmanuelle Saillard

Abstract: Identifying errors in parallel MPI programs is a challenging task. Despite the growing number of verification tools, debugging parallel programs remains a significant challenge. This paper is the first to utilize embedding and deep learning graph neural networks (GNNs) to tackle the issue of identifying bugs in MPI programs. Specifically, we have designed and developed two models that can determin… ▽ More Identifying errors in parallel MPI programs is a challenging task. Despite the growing number of verification tools, debugging parallel programs remains a significant challenge. This paper is the first to utilize embedding and deep learning graph neural networks (GNNs) to tackle the issue of identifying bugs in MPI programs. Specifically, we have designed and developed two models that can determine, from a code's LLVM Intermediate Representation (IR), whether the code is correct or contains a known MPI error. We tested our models using two dedicated MPI benchmark suites for verification: MBI and MPI-CorrBench. By training and validating our models on the same benchmark suite, we achieved a prediction accuracy of 92% in detecting error types. Additionally, we trained and evaluated our models on distinct benchmark suites (e.g., transitioning from MBI to MPI-CorrBench) and achieved a promising accuracy of over 80%. Finally, we investigated the interaction between different MPI errors and quantified our models' generalization capabilities over new unseen errors. This involved removing error types during training and assessing whether our models could still predict them. The detection accuracy of removed errors varies significantly between 20% to 80%, indicating connected error patterns. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.13465 [pdf, other]

Unsupervised learning based object detection using Contrastive Learning

Authors: Chandan Kumar, Jansel Herrera-Gerena, John Just, Matthew Darr, Ali Jannesari

Abstract: Training image-based object detectors presents formidable challenges, as it entails not only the complexities of object detection but also the added intricacies of precisely localizing objects within potentially diverse and noisy environments. However, the collection of imagery itself can often be straightforward; for instance, cameras mounted in vehicles can effortlessly capture vast amounts of d… ▽ More Training image-based object detectors presents formidable challenges, as it entails not only the complexities of object detection but also the added intricacies of precisely localizing objects within potentially diverse and noisy environments. However, the collection of imagery itself can often be straightforward; for instance, cameras mounted in vehicles can effortlessly capture vast amounts of data in various real-world scenarios. In light of this, we introduce a groundbreaking method for training single-stage object detectors through unsupervised/self-supervised learning. Our state-of-the-art approach has the potential to revolutionize the labeling process, substantially reducing the time and cost associated with manual annotation. Furthermore, it paves the way for previously unattainable research opportunities, particularly for large, diverse, and challenging datasets lacking extensive labels. In contrast to prevalent unsupervised learning methods that primarily target classification tasks, our approach takes on the unique challenge of object detection. We pioneer the concept of intra-image contrastive learning alongside inter-image counterparts, enabling the acquisition of crucial location information essential for object detection. The method adeptly learns and represents this location information, yielding informative heatmaps. Our results showcase an outstanding accuracy of \textbf{89.2\%}, marking a significant breakthrough of approximately \textbf{15x} over random initialization in the realm of unsupervised object detection within the field of computer vision. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: 10 pages, 5 figures

arXiv:2402.02018 [pdf, other]

The Landscape and Challenges of HPC Research and LLMs

Authors: Le Chen, Nesreen K. Ahmed, Akash Dutta, Arijit Bhattacharjee, Sixing Yu, Quazi Ishtiaque Mahmud, Waqwoya Abebe, Hung Phan, Aishwarya Sarkar, Branden Butler, Niranjan Hasabnis, Gal Oren, Vy A. Vo, Juan Pablo Munoz, Theodore L. Willke, Tim Mattson, Ali Jannesari

Abstract: Recently, language models (LMs), especially large language models (LLMs), have revolutionized the field of deep learning. Both encoder-decoder models and prompt-based techniques have shown immense potential for natural language processing and code-based tasks. Over the past several years, many research labs and institutions have invested heavily in high-performance computing, approaching or breach… ▽ More Recently, language models (LMs), especially large language models (LLMs), have revolutionized the field of deep learning. Both encoder-decoder models and prompt-based techniques have shown immense potential for natural language processing and code-based tasks. Over the past several years, many research labs and institutions have invested heavily in high-performance computing, approaching or breaching exascale performance levels. In this paper, we posit that adapting and utilizing such language model-based techniques for tasks in high-performance computing (HPC) would be very beneficial. This study presents our reasoning behind the aforementioned position and highlights how existing ideas can be improved and adapted for HPC tasks. △ Less

Submitted 6 February, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.16445 [pdf, other]

OMPGPT: A Generative Pre-trained Transformer Model for OpenMP

Authors: Le Chen, Arijit Bhattacharjee, Nesreen Ahmed, Niranjan Hasabnis, Gal Oren, Vy Vo, Ali Jannesari

Abstract: Large language models (LLMs)such as ChatGPT have significantly advanced the field of Natural Language Processing (NLP). This trend led to the development of code-based large language models such as StarCoder, WizardCoder, and CodeLlama, which are trained extensively on vast repositories of code and programming languages. While the generic abilities of these code LLMs are useful for many programmer… ▽ More Large language models (LLMs)such as ChatGPT have significantly advanced the field of Natural Language Processing (NLP). This trend led to the development of code-based large language models such as StarCoder, WizardCoder, and CodeLlama, which are trained extensively on vast repositories of code and programming languages. While the generic abilities of these code LLMs are useful for many programmers in tasks like code generation, the area of high-performance computing (HPC) has a narrower set of requirements that make a smaller and more domain-specific model a smarter choice. This paper presents OMPGPT, a novel domain-specific model meticulously designed to harness the inherent strengths of language models for OpenMP pragma generation. Furthermore, we leverage prompt engineering techniques from the NLP domain to create Chain-of-OMP, an innovative strategy designed to enhance OMPGPT's effectiveness. Our extensive evaluations demonstrate that OMPGPT outperforms existing large language models specialized in OpenMP tasks and maintains a notably smaller size, aligning it more closely with the typical hardware constraints of HPC environments. We consider our contribution as a pivotal bridge, connecting the advantage of language models with the specific demands of HPC tasks. △ Less

Submitted 21 June, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

arXiv:2312.17430 [pdf, other]

LEFL: Low Entropy Client Sampling in Federated Learning

Authors: Waqwoya Abebe, Pablo Munoz, Ali Jannesari

Abstract: Federated learning (FL) is a machine learning paradigm where multiple clients collaborate to optimize a single global model using their private data. The global model is maintained by a central server that orchestrates the FL training process through a series of training rounds. In each round, the server samples clients from a client pool before sending them its latest global model parameters for… ▽ More Federated learning (FL) is a machine learning paradigm where multiple clients collaborate to optimize a single global model using their private data. The global model is maintained by a central server that orchestrates the FL training process through a series of training rounds. In each round, the server samples clients from a client pool before sending them its latest global model parameters for further optimization. Naive sampling strategies implement random client sampling and fail to factor client data distributions for privacy reasons. Hence we propose LEFL, an alternative sampling strategy by performing a one-time clustering of clients based on their model's learned high-level features while respecting data privacy. This enables the server to perform stratified client sampling across clusters in every round. We show datasets of sampled clients selected with this approach yield a low relative entropy with respect to the global data distribution. Consequently, the FL training becomes less noisy and significantly improves the convergence of the global model by as much as 7.4% in some experiments. Furthermore, it also significantly reduces the communication rounds required to achieve a target accuracy. △ Less

Submitted 13 February, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

arXiv:2311.06505 [pdf, other]

CompCodeVet: A Compiler-guided Validation and Enhancement Approach for Code Dataset

Authors: Le Chen, Arijit Bhattacharjee, Nesreen K. Ahmed, Niranjan Hasabnis, Gal Oren, Bin Lei, Ali Jannesari

Abstract: Large language models (LLMs) have become increasingly prominent in academia and industry due to their remarkable performance in diverse applications. As these models evolve with increasing parameters, they excel in tasks like sentiment analysis and machine translation. However, even models with billions of parameters face challenges in tasks demanding multi-step reasoning. Code generation and comp… ▽ More Large language models (LLMs) have become increasingly prominent in academia and industry due to their remarkable performance in diverse applications. As these models evolve with increasing parameters, they excel in tasks like sentiment analysis and machine translation. However, even models with billions of parameters face challenges in tasks demanding multi-step reasoning. Code generation and comprehension, especially in C and C++, emerge as significant challenges. While LLMs trained on code datasets demonstrate competence in many tasks, they struggle with rectifying non-compilable C and C++ code. Our investigation attributes this subpar performance to two primary factors: the quality of the training dataset and the inherent complexity of the problem which demands intricate reasoning. Existing "Chain of Thought" (CoT) prompting techniques aim to enhance multi-step reasoning. This approach, however, retains the limitations associated with the latent drawbacks of LLMs. In this work, we propose CompCodeVet, a compiler-guided CoT approach to produce compilable code from non-compilable ones. Diverging from the conventional approach of utilizing larger LLMs, we employ compilers as a teacher to establish a more robust zero-shot thought process. The evaluation of CompCodeVet on two open-source code datasets shows that CompCodeVet has the ability to improve the training dataset quality for LLMs. △ Less

Submitted 11 November, 2023; originally announced November 2023.

arXiv:2310.04047 [pdf, other]

AUTOPARLLM: GNN-Guided Automatic Code Parallelization using Large Language Models

Authors: Quazi Ishtiaque Mahmud, Ali TehraniJamsaz, Hung D Phan, Nesreen K. Ahmed, Ali Jannesari

Abstract: Parallelizing sequentially written programs is a challenging task. Even experienced developers need to spend considerable time finding parallelism opportunities and then actually writing parallel versions of sequentially written programs. To address this issue, we present AUTOPARLLM, a framework for automatically discovering parallelism and generating the parallel version of the sequentially writt… ▽ More Parallelizing sequentially written programs is a challenging task. Even experienced developers need to spend considerable time finding parallelism opportunities and then actually writing parallel versions of sequentially written programs. To address this issue, we present AUTOPARLLM, a framework for automatically discovering parallelism and generating the parallel version of the sequentially written program. Our framework consists of two major components: i) a heterogeneous Graph Neural Network (GNN) based parallelism discovery and parallel pattern detection module, and ii) an LLM-based code generator to generate the parallel counterpart of the sequential programs. We use the GNN to learn the flow-aware characteristics of the programs to identify parallel regions in sequential programs and then construct an enhanced prompt using the GNN's results for the LLM-based generator to finally produce the parallel counterparts of the sequential programs. We evaluate AUTOPARLLM on 11 applications of 2 well-known benchmark suites: NAS Parallel Benchmark and Rodinia Benchmark. Our results show that AUTOPARLLM is indeed effective in improving the state-of-the-art LLM-based models for the task of parallel code generation in terms of multiple code generation metrics. AUTOPARLLM also improves the average runtime of the parallel code generated by the state-of-the-art LLMs by as high as 3.4% and 2.9% for the NAS Parallel Benchmark and Rodinia Benchmark respectively. Additionally, to overcome the issue that well-known metrics for translation evaluation have not been optimized to evaluate the quality of the generated parallel code, we propose OMPScore for evaluating the quality of the generated code. We show that OMPScore exhibits a better correlation with human judgment than existing metrics, measured by up to 75% improvement of Spearman correlation. △ Less

Submitted 8 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: 10 pages

arXiv:2310.00247 [pdf, other]

Bridging the Gap Between Foundation Models and Heterogeneous Federated Learning

Authors: Sixing Yu, J. Pablo Muñoz, Ali Jannesari

Abstract: Federated learning (FL) offers privacy-preserving decentralized machine learning, optimizing models at edge clients without sharing private data. Simultaneously, foundation models (FMs) have gained traction in the artificial intelligence (AI) community due to their exceptional performance across various tasks. However, integrating FMs into FL presents challenges, primarily due to their substantial… ▽ More Federated learning (FL) offers privacy-preserving decentralized machine learning, optimizing models at edge clients without sharing private data. Simultaneously, foundation models (FMs) have gained traction in the artificial intelligence (AI) community due to their exceptional performance across various tasks. However, integrating FMs into FL presents challenges, primarily due to their substantial size and intensive resource requirements. This is especially true when considering the resource heterogeneity in edge FL systems. We present an adaptive framework for Resource-aware Federated Foundation Models (RaFFM) to address these challenges. RaFFM introduces specialized model compression algorithms tailored for FL scenarios, such as salient parameter prioritization and high-performance subnetwork extraction. These algorithms enable dynamic scaling of given transformer-based FMs to fit heterogeneous resource constraints at the network edge during both FL's optimization and deployment stages. Experimental results demonstrate that RaFFM shows significant superiority in resource utilization efficiency and uses fewer resources to deploy FMs to FL. Despite the lower resource consumption, target models optimized by RaFFM achieve performance on par with traditional FL methods applied to full-sized FMs. This is evident across tasks in both natural language processing and computer vision domains. △ Less

Submitted 4 October, 2023; v1 submitted 30 September, 2023; originally announced October 2023.

arXiv:2308.04693 [pdf, other]

doi 10.1145/3583780.3614869

Evaluating and Optimizing the Effectiveness of Neural Machine Translation in Supporting Code Retrieval Models: A Study on the CAT Benchmark

Authors: Hung Phan, Ali Jannesari

Abstract: Neural Machine Translation (NMT) is widely applied in software engineering tasks. The effectiveness of NMT for code retrieval relies on the ability to learn from the sequence of tokens in the source language to the sequence of tokens in the target language. While NMT performs well in pseudocode-to-code translation, it might have challenges in learning to translate from natural language query to so… ▽ More Neural Machine Translation (NMT) is widely applied in software engineering tasks. The effectiveness of NMT for code retrieval relies on the ability to learn from the sequence of tokens in the source language to the sequence of tokens in the target language. While NMT performs well in pseudocode-to-code translation, it might have challenges in learning to translate from natural language query to source code in newly curated real-world code documentation/ implementation datasets. In this work, we analyze the performance of NMT in natural language-to-code translation in the newly curated CAT benchmark that includes the optimized versions of three Java datasets TLCodeSum, CodeSearchNet, Funcom, and a Python dataset PCSD. Our evaluation shows that NMT has low accuracy, measured by CrystalBLEU and Meteor metrics in this task. To alleviate the duty of NMT in learning complex representation of source code, we propose ASTTrans Representation, a tailored representation of an Abstract Syntax Tree (AST) using a subset of non-terminal nodes. We show that the classical approach NMT performs significantly better in learning ASTTrans Representation over code tokens with up to 36% improvement on Meteor score. Moreover, we leverage ASTTrans Representation to conduct combined code search processes from the state-of-the-art code search processes using GraphCodeBERT and UniXcoder. Our NMT models of learning ASTTrans Representation can boost the Mean Reciprocal Rank of these state-of-the-art code search processes by up to 3.08% and improve 23.08% of queries' results over the CAT benchmark. △ Less

Submitted 9 August, 2023; originally announced August 2023.

Comments: Accepted as Full Paper in Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM), Birmingham, UK, October 2023

arXiv:2306.00210 [pdf, other]

PERFOGRAPH: A Numerical Aware Program Graph Representation for Performance Optimization and Program Analysis

Authors: Ali TehraniJamsaz, Quazi Ishtiaque Mahmud, Le Chen, Nesreen K. Ahmed, Ali Jannesari

Abstract: The remarkable growth and significant success of machine learning have expanded its applications into programming languages and program analysis. However, a key challenge in adopting the latest machine learning methods is the representation of programming languages, which directly impacts the ability of machine learning methods to reason about programs. The absence of numerical awareness, aggregat… ▽ More The remarkable growth and significant success of machine learning have expanded its applications into programming languages and program analysis. However, a key challenge in adopting the latest machine learning methods is the representation of programming languages, which directly impacts the ability of machine learning methods to reason about programs. The absence of numerical awareness, aggregate data structure information, and improper way of presenting variables in previous representation works have limited their performances. To overcome the limitations and challenges of current program representations, we propose a graph-based program representation called PERFOGRAPH. PERFOGRAPH can capture numerical information and the aggregate data structure by introducing new nodes and edges. Furthermore, we propose an adapted embedding method to incorporate numerical awareness. These enhancements make PERFOGRAPH a highly flexible and scalable representation that effectively captures programs intricate dependencies and semantics. Consequently, it serves as a powerful tool for various applications such as program analysis, performance optimization, and parallelism discovery. Our experimental results demonstrate that PERFOGRAPH outperforms existing representations and sets new state-of-the-art results by reducing the error rate by 7.4% (AMD dataset) and 10% (NVIDIA dataset) in the well-known Device Mapping challenge. It also sets new state-of-the-art results in various performance optimization tasks like Parallelism Discovery and NUMA and Prefetchers Configuration prediction. △ Less

Submitted 29 November, 2023; v1 submitted 31 May, 2023; originally announced June 2023.

arXiv:2305.11414 [pdf, other]

Federated Foundation Models: Privacy-Preserving and Collaborative Learning for Large Models

Authors: Sixing Yu, J. Pablo Muñoz, Ali Jannesari

Abstract: Foundation Models (FMs), such as LLaMA, BERT, GPT, ViT, and CLIP, have demonstrated remarkable success in a wide range of applications, driven by their ability to leverage vast amounts of data for pre-training. However, optimizing FMs often requires access to sensitive data, raising privacy concerns and limiting their applicability in many domains. In this paper, we propose the Federated Foundatio… ▽ More Foundation Models (FMs), such as LLaMA, BERT, GPT, ViT, and CLIP, have demonstrated remarkable success in a wide range of applications, driven by their ability to leverage vast amounts of data for pre-training. However, optimizing FMs often requires access to sensitive data, raising privacy concerns and limiting their applicability in many domains. In this paper, we propose the Federated Foundation Models (FFMs) paradigm, which combines the benefits of FMs and Federated Learning (FL) to enable privacy-preserving and collaborative learning across multiple end-users. We discuss the potential benefits and challenges of integrating FL into the lifespan of FMs, covering pre-training, fine-tuning, and application. We further outline potential future research avenues in FFM, including FFM pre-training, FFM fine-tuning, and federated prompt tuning, which allow the development of more personalized and context-aware models while ensuring data privacy. Moreover, we explore the possibility of continual/lifelong learning in FFMs, as increased computational power at the edge may unlock the potential for optimizing FMs using newly generated private data close to the data source. The proposed FFM concepts offer a flexible and scalable framework for training large language models in a privacy-preserving manner, setting the stage for subsequent advancements in both FM training and federated learning. △ Less

Submitted 19 March, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted at the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

arXiv:2305.05779 [pdf, other]

Learning to Parallelize with OpenMP by Augmented Heterogeneous AST Representation

Authors: Le Chen, Quazi Ishtiaque Mahmud, Hung Phan, Nesreen K. Ahmed, Ali Jannesari

Abstract: Detecting parallelizable code regions is a challenging task, even for experienced developers. Numerous recent studies have explored the use of machine learning for code analysis and program synthesis, including parallelization, in light of the success of machine learning in natural language processing. However, applying machine learning techniques to parallelism detection presents several challeng… ▽ More Detecting parallelizable code regions is a challenging task, even for experienced developers. Numerous recent studies have explored the use of machine learning for code analysis and program synthesis, including parallelization, in light of the success of machine learning in natural language processing. However, applying machine learning techniques to parallelism detection presents several challenges, such as the lack of an adequate dataset for training, an effective code representation with rich information, and a suitable machine learning model to learn the latent features of code for diverse analyses. To address these challenges, we propose a novel graph-based learning approach called Graph2Par that utilizes a heterogeneous augmented abstract syntax tree (Augmented-AST) representation for code. The proposed approach primarily focused on loop-level parallelization with OpenMP. Moreover, we create an OMP\_Serial dataset with 18598 parallelizable and 13972 non-parallelizable loops to train the machine learning models. Our results show that our proposed approach achieves the accuracy of parallelizable code region detection with 85\% accuracy and outperforms the state-of-the-art token-based machine learning approach. These results indicate that our approach is competitive with state-of-the-art tools and capable of handling loops with complex structures that other tools may overlook. △ Less

Submitted 9 May, 2023; originally announced May 2023.

arXiv:2305.00875 [pdf, other]

Redundancy and Concept Analysis for Code-trained Language Models

Authors: Arushi Sharma, Zefu Hu, Christopher Quinn, Ali Jannesari

Abstract: Code-trained language models have proven to be highly effective for various code intelligence tasks. However, they can be challenging to train and deploy for many software engineering applications due to computational bottlenecks and memory constraints. Implementing effective strategies to address these issues requires a better understanding of these 'black box' models. In this paper, we perform t… ▽ More Code-trained language models have proven to be highly effective for various code intelligence tasks. However, they can be challenging to train and deploy for many software engineering applications due to computational bottlenecks and memory constraints. Implementing effective strategies to address these issues requires a better understanding of these 'black box' models. In this paper, we perform the first neuron-level analysis for source code models to identify \textit{important} neurons within latent representations. We achieve this by eliminating neurons that are highly similar or irrelevant to the given task. This approach helps us understand which neurons and layers can be eliminated (redundancy analysis) and where important code properties are located within the network (concept analysis). Using redundancy analysis, we make observations relevant to knowledge transfer and model optimization applications. We find that over 95\% of the neurons are redundant with respect to our code intelligence tasks and can be eliminated without significant loss in accuracy. We also discover several subsets of neurons that can make predictions with baseline accuracy. Through concept analysis, we explore the traceability and distribution of human-recognizable concepts within latent code representations which could be used to influence model predictions. We trace individual and subsets of important neurons to specific code properties and identify 'number' neurons, 'string' neurons, and higher-level 'text' neurons for token-level tasks and higher-level concepts important for sentence-level downstream tasks. This also helps us understand how decomposable and transferable task-related features are and can help devise better techniques for transfer learning, model compression, and the decomposition of deep neural networks into modules. △ Less

Submitted 15 February, 2024; v1 submitted 1 May, 2023; originally announced May 2023.

Comments: 4 figures, 6 tables

arXiv:2304.12568 [pdf, other]

Performance Optimization using Multimodal Modeling and Heterogeneous GNN

Authors: Akash Dutta, Jordi Alcaraz, Ali TehraniJamsaz, Eduardo Cesar, Anna Sikora, Ali Jannesari

Abstract: Growing heterogeneity and configurability in HPC architectures has made auto-tuning applications and runtime parameters on these systems very complex. Users are presented with a multitude of options to configure parameters. In addition to application specific solutions, a common approach is to use general purpose search strategies, which often might not identify the best configurations or their ti… ▽ More Growing heterogeneity and configurability in HPC architectures has made auto-tuning applications and runtime parameters on these systems very complex. Users are presented with a multitude of options to configure parameters. In addition to application specific solutions, a common approach is to use general purpose search strategies, which often might not identify the best configurations or their time to convergence is a significant barrier. There is, thus, a need for a general purpose and efficient tuning approach that can be easily scaled and adapted to various tuning tasks. We propose a technique for tuning parallel code regions that is general enough to be adapted to multiple tasks. In this paper, we analyze IR-based programming models to make task-specific performance optimizations. To this end, we propose the Multimodal Graph Neural Network and Autoencoder (MGA) tuner, a multimodal deep learning based approach that adapts Heterogeneous Graph Neural Networks and Denoizing Autoencoders for modeling IR-based code representations that serve as separate modalities. This approach is used as part of our pipeline to model a syntax, semantics, and structure-aware IR-based code representation for tuning parallel code regions/kernels. We extensively experiment on OpenMP and OpenCL code regions/kernels obtained from PolyBench, Rodinia, STREAM, DataRaceBench, AMD SDK, NPB, NVIDIA SDK, Parboil, SHOC, and LULESH benchmarks. We apply our multimodal learning techniques to the tasks of i) optimizing the number of threads, scheduling policy and chunk size in OpenMP loops and, ii) identifying the best device for heterogeneous device mapping of OpenCL kernels. Our experiments show that this multimodal learning based approach outperforms the state-of-the-art in all experiments. △ Less

Submitted 27 April, 2023; v1 submitted 25 April, 2023; originally announced April 2023.

Comments: 14 pages, 9 figures, 3 tables

arXiv:2304.04658 [pdf, other]

GraphBinMatch: Graph-based Similarity Learning for Cross-Language Binary and Source Code Matching

Authors: Ali TehraniJamsaz, Hanze Chen, Ali Jannesari

Abstract: Matching binary to source code and vice versa has various applications in different fields, such as computer security, software engineering, and reverse engineering. Even though there exist methods that try to match source code with binary code to accelerate the reverse engineering process, most of them are designed to focus on one programming language. However, in real life, programs are develope… ▽ More Matching binary to source code and vice versa has various applications in different fields, such as computer security, software engineering, and reverse engineering. Even though there exist methods that try to match source code with binary code to accelerate the reverse engineering process, most of them are designed to focus on one programming language. However, in real life, programs are developed using different programming languages depending on their requirements. Thus, cross-language binary-to-source code matching has recently gained more attention. Nonetheless, the existing approaches still struggle to have precise predictions due to the inherent difficulties when the problem of matching binary code and source code needs to be addressed across programming languages. In this paper, we address the problem of cross-language binary source code matching. We propose GraphBinMatch, an approach based on a graph neural network that learns the similarity between binary and source codes. We evaluate GraphBinMatch on several tasks, such as cross-language binary-to-source code matching and cross-language source-to-source matching. We also evaluate our approach performance on single-language binary-to-source code matching. Experimental results show that GraphBinMatch outperforms state-of-the-art significantly, with improvements as high as 15% over the F1 score. △ Less

Submitted 10 April, 2023; originally announced April 2023.

arXiv:2304.03487 [pdf, other]

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

Authors: Ali TehraniJamsaz, Alok Mishra, Akash Dutta, Abid M. Malik, Barbara Chapman, Ali Jannesari

Abstract: GPU-based HPC clusters are attracting more scientific application developers due to their extensive parallelism and energy efficiency. In order to achieve portability among a variety of multi/many core architectures, a popular choice for an application developer is to utilize directive-based parallel programming models, such as OpenMP. However, even with OpenMP, the developer must choose from amon… ▽ More GPU-based HPC clusters are attracting more scientific application developers due to their extensive parallelism and energy efficiency. In order to achieve portability among a variety of multi/many core architectures, a popular choice for an application developer is to utilize directive-based parallel programming models, such as OpenMP. However, even with OpenMP, the developer must choose from among many strategies for exploiting a GPU or a CPU. Recently, Machine Learning (ML) approaches have brought significant advances in the optimizations of HPC applications. To this end, several ways have been proposed to represent application characteristics for ML models. However, the available techniques fail to capture features that are crucial for exposing parallelism. In this paper, we introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree to represent control and data flow information. The originality of this work lies in the addition of new edges exploiting the implicit ordering and parent-child relationships in ASTs, as well as the introduction of edge weights to account for loop and condition information. We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region across CPUs and GPUs. Various transformations utilizing collapse and data transfer between the CPU and GPU are used to construct the dataset. The predicted runtime of the model is used to determine which transformation provides the best performance. Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions. △ Less

Submitted 7 April, 2023; originally announced April 2023.

arXiv:2302.11467 [pdf, other]

Power Constrained Autotuning using Graph Neural Networks

Authors: Akash Dutta, Jee Choi, Ali Jannesari

Abstract: Recent advances in multi and many-core processors have led to significant improvements in the performance of scientific computing applications. However, the addition of a large number of complex cores have also increased the overall power consumption, and power has become a first-order design constraint in modern processors. While we can limit power consumption by simply applying software-based po… ▽ More Recent advances in multi and many-core processors have led to significant improvements in the performance of scientific computing applications. However, the addition of a large number of complex cores have also increased the overall power consumption, and power has become a first-order design constraint in modern processors. While we can limit power consumption by simply applying software-based power constraints, applying them blindly will lead to non-trivial performance degradation. To address the challenge of improving the performance, power, and energy efficiency of scientific applications on modern multi-core processors, we propose a novel Graph Neural Network based auto-tuning approach that (i) optimizes runtime performance at pre-defined power constraints, and (ii) simultaneously optimizes for runtime performance and energy efficiency by minimizing the energy-delay product. The key idea behind this approach lies in modeling parallel code regions as flow-aware code graphs to capture both semantic and structural code features. We demonstrate the efficacy of our approach by conducting an extensive evaluation on $30$ benchmarks and proxy-/mini-applications with $68$ OpenMP code regions. Our approach identifies OpenMP configurations at different power constraints that yield a geometric mean performance improvement of more than $25\%$ and $13\%$ over the default OpenMP configuration on a 32-core Skylake and a $16$-core Haswell processor respectively. In addition, when we optimize for the energy-delay product, the OpenMP configurations selected by our auto-tuner demonstrate both performance improvement of $21\%$ and $11\%$ and energy reduction of $29\%$ and $18\%$ over the default OpenMP configuration at Thermal Design Power for the same Skylake and Haswell processors, respectively. △ Less

Submitted 22 February, 2023; originally announced February 2023.

Comments: 11 pages, 7 figures, 2 tables, IPDPS '23

arXiv:2301.11787 [pdf, other]

Accelerating Domain-aware Deep Learning Models with Distributed Training

Authors: Aishwarya Sarkar, Chaoqun Lu, Ali Jannesari

Abstract: Recent advances in data-generating techniques led to an explosive growth of geo-spatiotemporal data. In domains such as hydrology, ecology, and transportation, interpreting the complex underlying patterns of spatiotemporal interactions with the help of deep learning techniques hence becomes the need of the hour. However, applying deep learning techniques without domain-specific knowledge tends to… ▽ More Recent advances in data-generating techniques led to an explosive growth of geo-spatiotemporal data. In domains such as hydrology, ecology, and transportation, interpreting the complex underlying patterns of spatiotemporal interactions with the help of deep learning techniques hence becomes the need of the hour. However, applying deep learning techniques without domain-specific knowledge tends to provide sub-optimal prediction performance. Secondly, training such models on large-scale data requires extensive computational resources. To eliminate these challenges, we present a novel distributed domain-aware spatiotemporal network that utilizes domain-specific knowledge with improved model performance. Our network consists of a pixel-contribution block, a distributed multiheaded multichannel convolutional (CNN) spatial block, and a recurrent temporal block. We choose flood prediction in hydrology as a use case to test our proposed method. From our analysis, the network effectively predicts high peaks in discharge measurements at watershed outlets with up to 4.1x speedup and increased prediction performance of up to 93\%. Our approach achieved a 12.6x overall speedup and increased the mean prediction performance by 16\%. We perform extensive experiments on a dataset of 23 watersheds in a northern state of the U.S. and present our findings. △ Less

Submitted 25 January, 2023; originally announced January 2023.

Comments: Accepted for Workshop on Multi-scale, Multi-physic and Coupled Problems on Highly Parallel Systems, HPC Asia 2023, 27 February - 2 March 2023, Singapore

arXiv:2212.08743 [pdf, other]

Addressing Data Heterogeneity in Decentralized Learning via Topological Pre-processing

Authors: Waqwoya Abebe, Ali Jannesari

Abstract: Recently, local peer topology has been shown to influence the overall convergence of decentralized learning (DL) graphs in the presence of data heterogeneity. In this paper, we demonstrate the advantages of constructing a proxy-based locally heterogeneous DL topology to enhance convergence and maintain data privacy. In particular, we propose a novel peer clumping strategy to efficiently cluster pe… ▽ More Recently, local peer topology has been shown to influence the overall convergence of decentralized learning (DL) graphs in the presence of data heterogeneity. In this paper, we demonstrate the advantages of constructing a proxy-based locally heterogeneous DL topology to enhance convergence and maintain data privacy. In particular, we propose a novel peer clumping strategy to efficiently cluster peers before arranging them in a final training graph. By showing how locally heterogeneous graphs outperform locally homogeneous graphs of similar size and from the same global data distribution, we present a strong case for topological pre-processing. Moreover, we demonstrate the scalability of our approach by showing how the proposed topological pre-processing overhead remains small in large graphs while the performance gains get even more pronounced. Furthermore, we show the robustness of our approach in the presence of network partitions. △ Less

Submitted 16 December, 2022; originally announced December 2022.

arXiv:2212.06352 [pdf, other]

Towards Seamless Management of AI Models in High-Performance Computing

Authors: Sixing Yu, Murali Emani, Chunhua Liao, Pei-Hung Lin, Tristan Vanderbruggen, Xipeng Shen, Ali Jannesari

Abstract: With the increasing prevalence of artificial intelligence (AI) in diverse science/engineering communities, AI models emerge on an unprecedented scale among various domains. However, given the complexity and diversity of the software and hardware environments, reusing AI artifacts (models and datasets) is extremely challenging, especially with AI-driven science applications. Building an ecosystem t… ▽ More With the increasing prevalence of artificial intelligence (AI) in diverse science/engineering communities, AI models emerge on an unprecedented scale among various domains. However, given the complexity and diversity of the software and hardware environments, reusing AI artifacts (models and datasets) is extremely challenging, especially with AI-driven science applications. Building an ecosystem to run and reuse AI applications/datasets at scale efficiently becomes increasingly essential for diverse science and engineering and high-performance computing (HPC) communities. In this paper, we innovate over an HPC-AI ecosystem -- HPCFair, which enables the Findable, Accessible, Interoperable, and Reproducible (FAIR) principles. HPCFair enables the collection of AI models/datasets allowing users to download/upload AI artifacts with authentications. Most importantly, our proposed framework provides user-friendly APIs for users to easily run inference jobs and customize AI artifacts to their tasks as needed. Our results show that, with HPCFair API, users irrespective of technical expertise in AI, can easily leverage AI artifacts to their tasks with minimal effort. △ Less

Submitted 12 December, 2022; originally announced December 2022.

Comments: Accepted at the 2nd Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE)

arXiv:2211.05716 [pdf, other]

Resource-Aware Heterogeneous Federated Learning using Neural Architecture Search

Authors: Sixing Yu, J. Pablo Muñoz, Ali Jannesari

Abstract: Federated Learning (FL) is extensively used to train AI/ML models in distributed and privacy-preserving settings. Participant edge devices in FL systems typically contain non-independent and identically distributed (Non-IID) private data and unevenly distributed computational resources. Preserving user data privacy while optimizing AI/ML models in a heterogeneous federated network requires us to a… ▽ More Federated Learning (FL) is extensively used to train AI/ML models in distributed and privacy-preserving settings. Participant edge devices in FL systems typically contain non-independent and identically distributed (Non-IID) private data and unevenly distributed computational resources. Preserving user data privacy while optimizing AI/ML models in a heterogeneous federated network requires us to address data and system/resource heterogeneity. To address these challenges, we propose Resource-aware Federated Learning (RaFL). RaFL allocates resource-aware specialized models to edge devices using Neural Architecture Search (NAS) and allows heterogeneous model architecture deployment by knowledge extraction and fusion. Combining NAS and FL enables on-demand customized model deployment for resource-diverse edge devices. Furthermore, we propose a multi-model architecture fusion scheme allowing the aggregation of the distributed learning results. Results demonstrate RaFL's superior resource efficiency compared to SoTA. △ Less

Submitted 30 April, 2024; v1 submitted 9 November, 2022; originally announced November 2022.

Comments: Accepted at the 30th International European Conference on Parallel and Distributed Computing (Euro-Par 2024)

arXiv:2208.07978 [pdf, other]

Enhancing Heterogeneous Federated Learning with Knowledge Extraction and Multi-Model Fusion

Authors: Duy Phuong Nguyen, Sixing Yu, J. Pablo Muñoz, Ali Jannesari

Abstract: Concerned with user data privacy, this paper presents a new federated learning (FL) method that trains machine learning models on edge devices without accessing sensitive data. Traditional FL methods, although privacy-protective, fail to manage model heterogeneity and incur high communication costs due to their reliance on aggregation methods. To address this limitation, we propose a resource-awar… ▽ More Concerned with user data privacy, this paper presents a new federated learning (FL) method that trains machine learning models on edge devices without accessing sensitive data. Traditional FL methods, although privacy-protective, fail to manage model heterogeneity and incur high communication costs due to their reliance on aggregation methods. To address this limitation, we propose a resource-aware FL method that aggregates local knowledge from edge models and distills it into robust global knowledge through knowledge distillation. This method allows efficient multi-model knowledge fusion and the deployment of resource-aware models while preserving model heterogeneity. Our method improves communication cost and performance in heterogeneous data and models compared to existing FL algorithms. Notably, it reduces the communication cost of ResNet-32 by up to 50\% and VGG-11 by up to 10$\times$ while delivering superior performance. △ Less

Submitted 30 September, 2023; v1 submitted 16 August, 2022; originally announced August 2022.

Comments: Accept at the 4th workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S), SC 23

arXiv:2206.11023 [pdf, other]

doi 10.1145/3544902.3546248

Heterogeneous Graph Neural Networks for Software Effort Estimation

Authors: Hung Phan, Ali Jannesari

Abstract: Software effort can be measured by story point [35]. Current approaches for automatically estimating story points focus on applying pre-trained embedding models and deep learning for text regression to solve this problem which required expensive embedding models. We propose HeteroSP, a tool for estimating story points from textual input of Agile software project issues. We select GPT2SP [12] and D… ▽ More Software effort can be measured by story point [35]. Current approaches for automatically estimating story points focus on applying pre-trained embedding models and deep learning for text regression to solve this problem which required expensive embedding models. We propose HeteroSP, a tool for estimating story points from textual input of Agile software project issues. We select GPT2SP [12] and Deep-SE [8] as the baselines for comparison. First, from the analysis of the story point dataset [8], we conclude that software issues are actually a mixture of natural language sentences with quoted code snippets and have problems related to large-size vocabulary. Second, we provide a module to normalize the input text including words and code tokens of the software issues. Third, we design an algorithm to convert an input software issue to a graph with different types of nodes and edges. Fourth, we construct a heterogeneous graph neural networks model with the support of fastText [6] for constructing initial node embedding to learn and predict the story points of new issues. We did the comparison over three scenarios of estimation, including within project, cross-project within the repository, and cross-project cross repository with our baseline approaches. We achieve the average Mean Absolute Error (MAE) as 2.38, 2.61, and 2.63 for three scenarios. We outperform GPT2SP in 2/3 of the scenarios while outperforming Deep-SE in the most challenging scenario with significantly less amount of running time. We also compare our approaches with different homogeneous graph neural network models and the results show that the heterogeneous graph neural networks model outperforms the homogeneous models in story point estimation. For time performance, we achieve about 570 seconds as the time performance in both three processes: node embedding initialization, model construction, and story point estimation. △ Less

Submitted 30 June, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

Comments: Accepted in the Technical Papers Track of the 16th International Symposium on Empirical Software Engineering and Measurement, 2022 (ESEM 2022)

arXiv:2203.03062 [pdf, other]

Story Point Effort Estimation by Text Level Graph Neural Network

Authors: Hung Phan, Ali Jannesari

Abstract: Estimating the software projects' efforts developed by agile methods is important for project managers or technical leads. It provides a summary as a first view of how many hours and developers are required to complete the tasks. There are research works on automatic predicting the software efforts, including Term Frequency Inverse Document Frequency (TFIDF) as the traditional approach for this pr… ▽ More Estimating the software projects' efforts developed by agile methods is important for project managers or technical leads. It provides a summary as a first view of how many hours and developers are required to complete the tasks. There are research works on automatic predicting the software efforts, including Term Frequency Inverse Document Frequency (TFIDF) as the traditional approach for this problem. Graph Neural Network is a new approach that has been applied in Natural Language Processing for text classification. The advantages of Graph Neural Network are based on the ability to learn information via graph data structure, which has more representations such as the relationships between words compared to approaches of vectorizing sequence of words. In this paper, we show the potential and possible challenges of Graph Neural Network text classification in story point level estimation. By the experiments, we show that the GNN Text Level Classification can achieve as high accuracy as about 80 percent for story points level classification, which is comparable to the traditional approach. We also analyze the GNN approach and point out several current disadvantages that the GNN approach can improve for this problem or other problems in software engineering. △ Less

Submitted 14 March, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

Comments: accepted at The 1st International Workshop on Natural Language-based Software Engineering (to appear)

arXiv:2203.00611 [pdf, other]

Learning Intermediate Representations using Graph Neural Networks for NUMA and Prefetchers Optimization

Authors: Ali TehraniJamsaz, Mihail Popov, Akash Dutta, Emmanuelle Saillard, Ali Jannesari

Abstract: There is a large space of NUMA and hardware prefetcher configurations that can significantly impact the performance of an application. Previous studies have demonstrated how a model can automatically select configurations based on the dynamic properties of the code to achieve speedups. This paper demonstrates how the static Intermediate Representation (IR) of the code can guide NUMA/prefetcher opt… ▽ More There is a large space of NUMA and hardware prefetcher configurations that can significantly impact the performance of an application. Previous studies have demonstrated how a model can automatically select configurations based on the dynamic properties of the code to achieve speedups. This paper demonstrates how the static Intermediate Representation (IR) of the code can guide NUMA/prefetcher optimizations without the prohibitive cost of performance profiling. We propose a method to create a comprehensive dataset that includes a diverse set of intermediate representations along with optimum configurations. We then apply a graph neural network model in order to validate this dataset. We show that our static intermediate representation based model achieves 80% of the performance gains provided by expensive dynamic performance profiling based strategies. We further develop a hybrid model that uses both static and dynamic information. Our hybrid model achieves the same gains as the dynamic models but at a reduced cost by only profiling 30% of the programs. △ Less

Submitted 1 March, 2022; originally announced March 2022.

arXiv:2112.00847 [pdf, other]

CLAWS: Contrastive Learning with hard Attention and Weak Supervision

Authors: Jansel Herrera-Gerena, Ramakrishnan Sundareswaran, John Just, Matthew Darr, Ali Jannesari

Abstract: Learning effective visual representations without human supervision is a long-standing problem in computer vision. Recent advances in self-supervised learning algorithms have utilized contrastive learning, with methods such as SimCLR, which applies a composition of augmentations to an image, and minimizes a contrastive loss between the two augmented images. In this paper, we present CLAWS, an anno… ▽ More Learning effective visual representations without human supervision is a long-standing problem in computer vision. Recent advances in self-supervised learning algorithms have utilized contrastive learning, with methods such as SimCLR, which applies a composition of augmentations to an image, and minimizes a contrastive loss between the two augmented images. In this paper, we present CLAWS, an annotation-efficient learning framework, addressing the problem of manually labeling large-scale agricultural datasets along with potential applications such as anomaly detection and plant growth analytics. CLAWS uses a network backbone inspired by SimCLR and weak supervision to investigate the effect of contrastive learning within class clusters. In addition, we inject a hard attention mask to the cropped input image before maximizing agreement between the image pairs using a contrastive loss function. This mask forces the network to focus on pertinent object features and ignore background features. We compare results between a supervised SimCLR and CLAWS using an agricultural dataset with 227,060 samples consisting of 11 different crop classes. Our experiments and extensive evaluations show that CLAWS achieves a competitive NMI score of 0.7325. Furthermore, CLAWS engenders the creation of low dimensional representations of very large datasets with minimal parameter tuning and forming well-defined clusters, which lends themselves to using efficient, transparent, and highly interpretable clustering methods such as Gaussian Mixture Models. △ Less

Submitted 31 January, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

arXiv:2111.14345 [pdf, other]

SPATL: Salient Parameter Aggregation and Transfer Learning for Heterogeneous Clients in Federated Learning

Authors: Sixing Yu, Phuong Nguyen, Waqwoya Abebe, Wei Qian, Ali Anwar, Ali Jannesari

Abstract: Federated learning~(FL) facilitates the training and deploying AI models on edge devices. Preserving user data privacy in FL introduces several challenges, including expensive communication costs, limited resources, and data heterogeneity. In this paper, we propose SPATL, an FL method that addresses these issues by: (a) introducing a salient parameter selection agent and communicating selected par… ▽ More Federated learning~(FL) facilitates the training and deploying AI models on edge devices. Preserving user data privacy in FL introduces several challenges, including expensive communication costs, limited resources, and data heterogeneity. In this paper, we propose SPATL, an FL method that addresses these issues by: (a) introducing a salient parameter selection agent and communicating selected parameters only; (b) splitting a model into a shared encoder and a local predictor, and transferring its knowledge to heterogeneous clients via the locally customized predictor. Additionally, we leverage a gradient control mechanism to further speed up model convergence and increase robustness of training processes. Experiments demonstrate that SPATL reduces communication overhead, accelerates model inference, and enables stable training processes with better results compared to state-of-the-art methods. Our approach reduces communication cost by up to $86.45\%$, accelerates local inference by reducing up to $39.7\%$ FLOPs on VGG-11, and requires $7.4 \times$ less communication overhead when training ResNet-20. △ Less

Submitted 26 August, 2022; v1 submitted 29 November, 2021; originally announced November 2021.

Comments: Accepted at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 22)

arXiv:2110.00841 [pdf, other]

Transfer Learning Approaches for Knowledge Discovery in Grid-based Geo-Spatiotemporal Data

Authors: Aishwarya Sarkar, Jien Zhang, Chaoqun Lu, Ali Jannesari

Abstract: Extracting and meticulously analyzing geo-spatiotemporal features is crucial to recognize intricate underlying causes of natural events, such as floods. Limited evidence about hidden factors leading to climate change makes it challenging to predict regional water discharge accurately. In addition, the explosive growth in complex geo-spatiotemporal environment data that requires repeated learning b… ▽ More Extracting and meticulously analyzing geo-spatiotemporal features is crucial to recognize intricate underlying causes of natural events, such as floods. Limited evidence about hidden factors leading to climate change makes it challenging to predict regional water discharge accurately. In addition, the explosive growth in complex geo-spatiotemporal environment data that requires repeated learning by the state-of-the-art neural networks for every new region emphasizes the need for new computationally efficient methods, advanced computational resources, and extensive training on a massive amount of available monitored data. We, therefore, propose HydroDeep, an effectively reusable pretrained model to address this problem of transferring knowledge from one region to another by effectively capturing their intrinsic geo-spatiotemporal variance. Further, we present four transfer learning approaches on HydroDeep for spatiotemporal interpretability that improve Nash-Sutcliffe efficiency by 9% to 108% in new regions with a 95% reduction in time. △ Less

Submitted 1 November, 2021; v1 submitted 2 October, 2021; originally announced October 2021.

arXiv:2109.12714 [pdf, other]

Cluster Analysis with Deep Embeddings and Contrastive Learning

Authors: Ramakrishnan Sundareswaran, Jansel Herrera-Gerena, John Just, Ali Jannesari

Abstract: Unsupervised disentangled representation learning is a long-standing problem in computer vision. This work proposes a novel framework for performing image clustering from deep embeddings by combining instance-level contrastive learning with a deep embedding based cluster center predictor. Our approach jointly learns representations and predicts cluster centers in an end-to-end manner. This is acco… ▽ More Unsupervised disentangled representation learning is a long-standing problem in computer vision. This work proposes a novel framework for performing image clustering from deep embeddings by combining instance-level contrastive learning with a deep embedding based cluster center predictor. Our approach jointly learns representations and predicts cluster centers in an end-to-end manner. This is accomplished via a three-pronged approach that combines a clustering loss, an instance-wise contrastive loss, and an anchor loss. Our fundamental intuition is that using an ensemble loss that incorporates instance-level features and a clustering procedure focusing on semantic similarity reinforces learning better representations in the latent space. We observe that our method performs exceptionally well on popular vision datasets when evaluated using standard clustering metrics such as Normalized Mutual Information (NMI), in addition to producing geometrically well-separated cluster embeddings as defined by the Euclidean distance. Our framework performs on par with widely accepted clustering methods and outperforms the state-of-the-art contrastive learning method on the CIFAR-10 dataset with an NMI score of 0.772, a 7-8% improvement on the strong baseline. △ Less

Submitted 2 October, 2021; v1 submitted 26 September, 2021; originally announced September 2021.

arXiv:2109.02145 [pdf, other]

Temporal Shift Reinforcement Learning

Authors: Deepak George Thomas, Tichakorn Wongpiromsarn, Ali Jannesari

Abstract: The function approximators employed by traditional image-based Deep Reinforcement Learning (DRL) algorithms usually lack a temporal learning component and instead focus on learning the spatial component. We propose a technique, Temporal Shift Reinforcement Learning (TSRL), wherein both temporal, as well as spatial components are jointly learned. Moreover, TSRL does not require additional parameter… ▽ More The function approximators employed by traditional image-based Deep Reinforcement Learning (DRL) algorithms usually lack a temporal learning component and instead focus on learning the spatial component. We propose a technique, Temporal Shift Reinforcement Learning (TSRL), wherein both temporal, as well as spatial components are jointly learned. Moreover, TSRL does not require additional parameters to perform temporal learning. We show that TSRL outperforms the commonly used frame stacking heuristic on both of the Atari environments we test on while beating the SOTA for one of them. This investigation has implications in the robotics as well as sequential decision-making domains. △ Less

Submitted 26 October, 2021; v1 submitted 5 September, 2021; originally announced September 2021.

arXiv:2106.06921 [pdf, other]

Heterogeneous Federated Learning using Dynamic Model Pruning and Adaptive Gradient

Authors: Sixing Yu, Phuong Nguyen, Ali Anwar, Ali Jannesari

Abstract: Federated Learning (FL) has emerged as a new paradigm for training machine learning models distributively without sacrificing data security and privacy. Learning models on edge devices such as mobile phones is one of the most common use cases for FL. However, Non-identical independent distributed~(non-IID) data in edge devices easily leads to training failures. Especially, over-parameterized machi… ▽ More Federated Learning (FL) has emerged as a new paradigm for training machine learning models distributively without sacrificing data security and privacy. Learning models on edge devices such as mobile phones is one of the most common use cases for FL. However, Non-identical independent distributed~(non-IID) data in edge devices easily leads to training failures. Especially, over-parameterized machine learning models can easily be over-fitted on such data, hence, resulting in inefficient federated learning and poor model performance. To overcome the over-fitting issue, we proposed an adaptive dynamic pruning approach for FL, which can dynamically slim the model by dropping out unimportant parameters, hence, preventing over-fittings. Since the machine learning model's parameters react differently for different training samples, adaptive dynamic pruning will evaluate the salience of the model's parameter according to the input training sample, and only retain the salient parameter's gradients when doing back-propagation. We performed comprehensive experiments to evaluate our approach. The results show that our approach by removing the redundant parameters in neural networks can significantly reduce the over-fitting issue and greatly improves the training efficiency. In particular, when training the ResNet-32 on CIFAR-10, our approach reduces the communication cost by 57\%. We further demonstrate the inference acceleration capability of the proposed algorithm. Our approach reduces up to 50\% FLOPs inference of DNNs on edge devices while maintaining the model's quality. △ Less

Submitted 9 February, 2023; v1 submitted 13 June, 2021; originally announced June 2021.

Comments: Preprint of the CCGrid 2023 Submission

arXiv:2105.12254 [pdf, other]

Interpretable UAV Collision Avoidance using Deep Reinforcement Learning

Authors: Deepak-George Thomas, Daniil Olshanskyi, Karter Krueger, Tichakorn Wongpiromsarn, Ali Jannesari

Abstract: The significant components of any successful autonomous flight system are task completion and collision avoidance. Most deep learning algorithms successfully execute these aspects under the environment and conditions they are trained. However, they fail when subjected to novel environments. This paper presents an autonomous multi-rotor flight algorithm, using Deep Reinforcement Learning augmented… ▽ More The significant components of any successful autonomous flight system are task completion and collision avoidance. Most deep learning algorithms successfully execute these aspects under the environment and conditions they are trained. However, they fail when subjected to novel environments. This paper presents an autonomous multi-rotor flight algorithm, using Deep Reinforcement Learning augmented with Self-Attention Models, that can effectively reason when subjected to varying inputs. In addition to their reasoning ability, they are also interpretable, enabling it to be used under real-world conditions. We have tested our algorithm under different weather conditions and environments and found it robust compared to conventional Deep Reinforcement Learning algorithms. △ Less

Submitted 4 June, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

arXiv:2103.06403 [pdf, other]

A Vision Based Deep Reinforcement Learning Algorithm for UAV Obstacle Avoidance

Authors: Jeremy Roghair, Kyungtae Ko, Amir Ehsan Niaraki Asli, Ali Jannesari

Abstract: Integration of reinforcement learning with unmanned aerial vehicles (UAVs) to achieve autonomous flight has been an active research area in recent years. An important part focuses on obstacle detection and avoidance for UAVs navigating through an environment. Exploration in an unseen environment can be tackled with Deep Q-Network (DQN). However, value exploration with uniform sampling of actions m… ▽ More Integration of reinforcement learning with unmanned aerial vehicles (UAVs) to achieve autonomous flight has been an active research area in recent years. An important part focuses on obstacle detection and avoidance for UAVs navigating through an environment. Exploration in an unseen environment can be tackled with Deep Q-Network (DQN). However, value exploration with uniform sampling of actions may lead to redundant states, where often the environments inherently bear sparse rewards. To resolve this, we present two techniques for improving exploration for UAV obstacle avoidance. The first is a convergence-based approach that uses convergence error to iterate through unexplored actions and temporal threshold to balance exploration and exploitation. The second is a guidance-based approach using a Domain Network which uses a Gaussian mixture distribution to compare previously seen states to a predicted next state in order to select the next action. Performance and evaluation of these approaches were implemented in multiple 3-D simulation environments, with variation in complexity. The proposed approach demonstrates a two-fold improvement in average rewards compared to state of the art. △ Less

Submitted 10 March, 2021; originally announced March 2021.

Comments: 12 pages, 6 figures

arXiv:2102.03214 [pdf, other]

Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning

Authors: Sixing Yu, Arya Mazaheri, Ali Jannesari

Abstract: Model compression is an essential technique for deploying deep neural networks (DNNs) on power and memory-constrained resources. However, existing model-compression methods often rely on human expertise and focus on parameters' local importance, ignoring the rich topology information within DNNs. In this paper, we propose a novel multi-stage graph embedding technique based on graph neural networks… ▽ More Model compression is an essential technique for deploying deep neural networks (DNNs) on power and memory-constrained resources. However, existing model-compression methods often rely on human expertise and focus on parameters' local importance, ignoring the rich topology information within DNNs. In this paper, we propose a novel multi-stage graph embedding technique based on graph neural networks (GNNs) to identify DNN topologies and use reinforcement learning (RL) to find a suitable compression policy. We performed resource-constrained (i.e., FLOPs) channel pruning and compared our approach with state-of-the-art model compression methods. We evaluated our method on various models from typical to mobile-friendly networks, such as ResNet family, VGG-16, MobileNet-v1/v2, and ShuffleNet. Results show that our method can achieve higher compression ratios with a minimal fine-tuning cost yet yields outstanding and competitive performance. △ Less

Submitted 1 July, 2022; v1 submitted 5 February, 2021; originally announced February 2021.

Comments: Accepted at ICML 2022 Long presentation

arXiv:2011.12641 [pdf, other]

Auto Graph Encoder-Decoder for Neural Network Pruning

Authors: Sixing Yu, Arya Mazaheri, Ali Jannesari

Abstract: Model compression aims to deploy deep neural networks (DNN) on mobile devices with limited computing and storage resources. However, most of the existing model compression methods rely on manually defined rules, which require domain expertise. DNNs are essentially computational graphs, which contain rich structural information. In this paper, we aim to find a suitable compression policy from DNNs'… ▽ More Model compression aims to deploy deep neural networks (DNN) on mobile devices with limited computing and storage resources. However, most of the existing model compression methods rely on manually defined rules, which require domain expertise. DNNs are essentially computational graphs, which contain rich structural information. In this paper, we aim to find a suitable compression policy from DNNs' structural information. We propose an automatic graph encoder-decoder model compression (AGMC) method combined with graph neural networks (GNN) and reinforcement learning (RL). We model the target DNN as a graph and use GNN to learn the DNN's embeddings automatically. We compared our method with rule-based DNN embedding model compression methods to show the effectiveness of our method. Results show that our learning-based DNN embedding achieves better performance and a higher compression ratio with fewer search steps. We evaluated our method on over-parameterized and mobile-friendly DNNs and compared our method with handcrafted and learning-based model compression approaches. On over parameterized DNNs, such as ResNet-56, our method outperformed handcrafted and learning-based methods with $4.36\%$ and $2.56\%$ higher accuracy, respectively. Furthermore, on MobileNet-v2, we achieved a higher compression ratio than state-of-the-art methods with just $0.93\%$ accuracy loss. △ Less

Submitted 9 November, 2021; v1 submitted 25 November, 2020; originally announced November 2020.

Comments: In proc. of ICCV 2021

arXiv:2010.04328 [pdf, other]

HydroDeep -- A Knowledge Guided Deep Neural Network for Geo-Spatiotemporal Data Analysis

Authors: Aishwarya Sarkar, Jien Zhang, Chaoqun Lu, Ali Jannesari

Abstract: Due to limited evidence and complex causes of regional climate change, the confidence in predicting fluvial floods remains low. Understanding the fundamental mechanisms intrinsic to geo-spatiotemporal information is crucial to improve the prediction accuracy. This paper demonstrates a hybrid neural network architecture - HydroDeep, that couples a process-based hydro-ecological model with a combina… ▽ More Due to limited evidence and complex causes of regional climate change, the confidence in predicting fluvial floods remains low. Understanding the fundamental mechanisms intrinsic to geo-spatiotemporal information is crucial to improve the prediction accuracy. This paper demonstrates a hybrid neural network architecture - HydroDeep, that couples a process-based hydro-ecological model with a combination of Deep Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) Network. HydroDeep outperforms the independent CNN's and LSTM's performance by 1.6% and 10.5% respectively in Nash-Sutcliffe efficiency. Also, we show that HydroDeep pre-trained in one region is adept at passing on its knowledge to distant places via unique transfer learning approaches that minimize HydroDeep's training duration for a new region by learning its regional geo-spatiotemporal features in a reduced number of iterations. △ Less

Submitted 8 February, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

arXiv:2008.08951 [pdf, other]

Static Neural Compiler Optimization via Deep Reinforcement Learning

Authors: Rahim Mammadli, Ali Jannesari, Felix Wolf

Abstract: The phase-ordering problem of modern compilers has received a lot of attention from the research community over the years, yet remains largely unsolved. Various optimization sequences exposed to the user are manually designed by compiler developers. In designing such a sequence developers have to choose the set of optimization passes, their parameters and ordering within a sequence. Resulting sequ… ▽ More The phase-ordering problem of modern compilers has received a lot of attention from the research community over the years, yet remains largely unsolved. Various optimization sequences exposed to the user are manually designed by compiler developers. In designing such a sequence developers have to choose the set of optimization passes, their parameters and ordering within a sequence. Resulting sequences usually fall short of achieving optimal runtime for a given source code and may sometimes even degrade the performance when compared to unoptimized version. In this paper, we employ a deep reinforcement learning approach to the phase-ordering problem. Provided with sub-sequences constituting LLVM's O3 sequence, our agent learns to outperform the O3 sequence on the set of source codes used for training and achieves competitive performance on the validation set, gaining up to 1.32x speedup on previously-unseen programs. Notably, our approach differs from autotuning methods by not depending on one or more test runs of the program for making successful optimization decisions. It has no dependence on any dynamic feature, but only on the statically-attainable intermediate representation of the source code. We believe that the models trained using our approach can be integrated into modern compilers as neural optimization agents, at first to complement, and eventually replace the hand-crafted optimization sequences. △ Less

Submitted 16 October, 2020; v1 submitted 20 August, 2020; originally announced August 2020.

Comments: 11 pages, 5 figures

arXiv:1909.12217 [pdf, other]

Visual Exploration and Energy-aware Path Planning via Reinforcement Learning

Authors: Amir Niaraki, Jeremy Roghair, Ali Jannesari

Abstract: Visual exploration and smart data collection via autonomous vehicles is an attractive topic in various disciplines. Disturbances like wind significantly influence both the power consumption of the flying robots and the performance of the camera. We propose a reinforcement learning approach which combines the effects of the power consumption and the object detection modules to develop a policy for… ▽ More Visual exploration and smart data collection via autonomous vehicles is an attractive topic in various disciplines. Disturbances like wind significantly influence both the power consumption of the flying robots and the performance of the camera. We propose a reinforcement learning approach which combines the effects of the power consumption and the object detection modules to develop a policy for object detection in large areas with limited battery life. The learning model enables dynamic learning of the negative rewards of each action based on the drag forces that is resulted by the motion of the flying robot with respect to the wind field. The algorithm is implemented in a near-real world simulation environment both for the planar motion and flight in different altitudes. The trained agent often performed a trade-off between detecting the objects with high accuracy and increasing the area coverage within its battery life. The developed exploration policy outperformed the complete coverage algorithm by minimizing the traveled path while finding the target objects. The performance of the algorithms under various wind fields was evaluated in planar and 3D motion. During an exploration task with sparsely distributed goals and within a UAV's battery life, the proposed architecture could detect more than twice the amount of goal objects compared to the coverage path planning algorithm in moderate wind field. In high wind intensities, the energy-aware algorithm could detect 4 times the amount of goal objects when compared to its complete coverage counterpart. △ Less

Submitted 25 January, 2021; v1 submitted 26 September, 2019; originally announced September 2019.

Comments: 20 Pages, 14 figures

arXiv:1907.07110 [pdf]

DeepRace: Finding Data Race Bugs via Deep Learning

Authors: Ali Tehrani, Mohammed Khaleel, Reza Akbari, Ali Jannesari

Abstract: With the proliferation of multi-core hardware, parallel programs have become ubiquitous. These programs have their own type of bugs known as concurrency bugs and among them, data race bugs have been mostly in the focus of researchers over the past decades. In fact, detecting data races is a very challenging and important task. There have been several research paths in this area with many sophistic… ▽ More With the proliferation of multi-core hardware, parallel programs have become ubiquitous. These programs have their own type of bugs known as concurrency bugs and among them, data race bugs have been mostly in the focus of researchers over the past decades. In fact, detecting data races is a very challenging and important task. There have been several research paths in this area with many sophisticated tools designed and utilized that focus on detecting data race at the file level. In this paper, we propose DeepRace, a novel approach toward detecting data races in the source code. We build a deep neural network model to find data races instead of creating a data race detector manually. Our model uses a one-layer convolutional neural network (CNN) with different window size to find data races method. Then we adopt the class activation map function with global average pooling to extract the weights of the last convolutional layer and backpropagate it with the input source code to extract the line of codes with a data race. Thus, the DeepRace model can detect the data race bugs on a file and line of code level. In addition, we noticed that DeepRace successfully detects several buggy lines of code at different locations of the file. We tested the model with OpenMP and POSIX source code datasets which consist of more than 5000 and 8000 source code files respectively. We were able to successfully classify buggy source code files and achieve accuracies ranging from 81% and 86%. We also measured the performance of detecting and visualizing the data race at the line of code levels and our model achieved promising results. We only had a small number of false positives and false, ranging from 1 to 10. Furthermore, we used the intersection of union to measure the accuracy of the buggy lines of code, our model achieved promising results of 66 percent. △ Less

Submitted 15 July, 2019; originally announced July 2019.

Comments: 9 pages

arXiv:1907.06205 [pdf, other]

Automatic Repair and Type Binding of Undeclared Variables using Neural Networks

Authors: Venkatesh Theru Mohan, Ali Jannesari

Abstract: Deep learning had been used in program analysis for the prediction of hidden software defects using software defect datasets, security vulnerabilities using generative adversarial networks as well as identifying syntax errors by learning a trained neural machine translation on program codes. However, all these approaches either require defect datasets or bug-free source codes that are executable f… ▽ More Deep learning had been used in program analysis for the prediction of hidden software defects using software defect datasets, security vulnerabilities using generative adversarial networks as well as identifying syntax errors by learning a trained neural machine translation on program codes. However, all these approaches either require defect datasets or bug-free source codes that are executable for training the deep learning model. Our neural network model is neither trained with any defect datasets nor bug-free programming source codes, instead it is trained using structural semantic details of Abstract Syntax Tree (AST) where each node represents a construct appearing in the source code. This model is implemented to fix one of the most common semantic errors, such as undeclared variable errors as well as infer their type information before program compilation. By this approach, the model has achieved in correctly locating and identifying 81% of the programs on prutor dataset of 1059 programs with only undeclared variable errors and also inferring their types correctly in 80% of the programs. △ Less

Submitted 14 July, 2019; originally announced July 2019.

Comments: 16 pages, 16 figures

arXiv:1906.00786 [pdf, other]

Efficient Object Detection Model for Real-Time UAV Applications

Authors: Subrahmanyam Vaddi, Chandan Kumar, Ali Jannesari

Abstract: Unmanned Aerial Vehicles (UAVs) especially drones, equipped with vision techniques have become very popular in recent years, with their extensive use in wide range of applications. Many of these applications require use of computer vision techniques, particularly object detection from the information captured by on-board camera. In this paper, we propose an end to end object detection model runnin… ▽ More Unmanned Aerial Vehicles (UAVs) especially drones, equipped with vision techniques have become very popular in recent years, with their extensive use in wide range of applications. Many of these applications require use of computer vision techniques, particularly object detection from the information captured by on-board camera. In this paper, we propose an end to end object detection model running on a UAV platform which is suitable for real-time applications. We propose a deep feature pyramid architecture which makes use of inherent properties of features extracted from Convolutional Networks by capturing more generic features in the images (such as edge, color etc.) along with the minute detailed features specific to the classes contained in our problem. We use VisDrone-18 dataset for our studies which contain different objects such as pedestrians, vehicles, bicycles etc. We provide software and hardware architecture of our platform used in this study. We implemented our model with both ResNet and MobileNet as convolutional bases. Our model combined with modified focal loss function, produced a desirable performance of 30.6 mAP for object detection with an inference time of 14 fps. We compared our results with RetinaNet-ResNet-50 and HAL-RetinaNet and shown that our model combined with MobileNet as backend feature extractor gave the best results in terms of accuracy, speed and memory efficiency and is best suitable for real time object detection with drones. △ Less

Submitted 30 May, 2019; originally announced June 2019.

Comments: 10 pages, 4 figures, Under Review. arXiv admin note: substantial text overlap with arXiv:1808.07256 by other authors without attribution; substantial text overlap with arXiv:1807.06789, arXiv:1612.03144, arXiv:1809.03193 by other authors

arXiv:1611.06945 [pdf, other]

A Metaprogramming and Autotuning Framework for Deploying Deep Learning Applications

Authors: Matthew W. Moskewicz, Ali Jannesari, Kurt Keutzer

Abstract: In recent years, deep neural networks (DNNs), have yielded strong results on a wide range of applications. Graphics Processing Units (GPUs) have been one key enabling factor leading to the current popularity of DNNs. However, despite increasing hardware flexibility and software programming toolchain maturity, high efficiency GPU programming remains difficult: it suffers from high complexity, low p… ▽ More In recent years, deep neural networks (DNNs), have yielded strong results on a wide range of applications. Graphics Processing Units (GPUs) have been one key enabling factor leading to the current popularity of DNNs. However, despite increasing hardware flexibility and software programming toolchain maturity, high efficiency GPU programming remains difficult: it suffers from high complexity, low productivity, and low portability. GPU vendors such as NVIDIA have spent enormous effort to write special-purpose DNN libraries. However, on other hardware targets, especially mobile GPUs, such vendor libraries are not generally available. Thus, the development of portable, open, high-performance, energy-efficient GPU code for DNN operations would enable broader deployment of DNN-based algorithms. Toward this end, this work presents a framework to enable productive, high-efficiency GPU programming for DNN computations across hardware platforms and programming models. In particular, the framework provides specific support for metaprogramming, autotuning, and DNN-tailored data types. Using our framework, we explore implementing DNN operations on three different hardware targets: NVIDIA, AMD, and Qualcomm GPUs. On NVIDIA GPUs, we show both portability between OpenCL and CUDA as well competitive performance compared to the vendor library. On Qualcomm GPUs, we show that our framework enables productive development of target-specific optimizations, and achieves reasonable absolute performance. Finally, On AMD GPUs, we show initial results that indicate our framework can yield reasonable performance on a new platform with minimal effort. △ Less

Submitted 21 November, 2016; originally announced November 2016.

Showing 1–49 of 49 results for author: Jannesari, A