Search | arXiv e-print repository

DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models

Authors: Eman Ali, Sathira Silva, Muhammad Haris Khan

Abstract: Vision-language models (VLMs), e.g., CLIP, have shown remarkable potential in zero-shot image classification. However, adapting these models to new domains remains challenging, especially in unsupervised settings where labelled data is unavailable. Recent research has proposed pseudo-labelling approaches to adapt CLIP in an unsupervised manner using unlabelled target data. Nonetheless, these metho… ▽ More Vision-language models (VLMs), e.g., CLIP, have shown remarkable potential in zero-shot image classification. However, adapting these models to new domains remains challenging, especially in unsupervised settings where labelled data is unavailable. Recent research has proposed pseudo-labelling approaches to adapt CLIP in an unsupervised manner using unlabelled target data. Nonetheless, these methods struggle due to noisy pseudo-labels resulting from the misalignment between CLIP's visual and textual representations. This study introduces DPA, an unsupervised domain adaptation method for VLMs. DPA introduces the concept of dual prototypes, acting as distinct classifiers, along with the convex combination of their outputs, thereby leading to accurate pseudo-label construction. Next, it ranks pseudo-labels to facilitate robust self-training, particularly during early training. Finally, it addresses visual-textual misalignment by aligning textual prototypes with image prototypes to further improve the adaptation performance. Experiments on 13 downstream vision tasks demonstrate that DPA significantly outperforms zero-shot CLIP and the state-of-the-art unsupervised adaptation baselines. △ Less

Submitted 16 August, 2024; originally announced August 2024.

arXiv:2408.07445 [pdf, other]

Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach

Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Zaigham Zaheer, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf, Hassan Sajjad, Tom De Schepper, Markus Schedl

Abstract: Multimodal networks have demonstrated remarkable performance improvements over their unimodal counterparts. Existing multimodal networks are designed in a multi-branch fashion that, due to the reliance on fusion strategies, exhibit deteriorated performance if one or more modalities are missing. In this work, we propose a modality invariant multimodal learning method, which is less susceptible to t… ▽ More Multimodal networks have demonstrated remarkable performance improvements over their unimodal counterparts. Existing multimodal networks are designed in a multi-branch fashion that, due to the reliance on fusion strategies, exhibit deteriorated performance if one or more modalities are missing. In this work, we propose a modality invariant multimodal learning method, which is less susceptible to the impact of missing modalities. It consists of a single-branch network sharing weights across multiple modalities to learn inter-modality representations to maximize performance as well as robustness to missing modalities. Extensive experiments are performed on four challenging datasets including textual-visual (UPMC Food-101, Hateful Memes, Ferramenta) and audio-visual modalities (VoxCeleb1). Our proposed method achieves superior performance when all modalities are present as well as in the case of missing modalities during training or testing compared to the existing state-of-the-art methods. △ Less

Submitted 14 August, 2024; originally announced August 2024.

arXiv:2408.06755 [pdf, other]

Sumotosima: A Framework and Dataset for Classifying and Summarizing Otoscopic Images

Authors: Eram Anwarul Khan, Anas Anwarul Haq Khan

Abstract: Otoscopy is a diagnostic procedure to examine the ear canal and eardrum using an otoscope. It identifies conditions like infections, foreign bodies, ear drum perforations and ear abnormalities. We propose a novel resource efficient deep learning and transformer based framework, Sumotosima (Summarizer for otoscopic images), an end-to-end pipeline for classification followed by summarization. Our fr… ▽ More Otoscopy is a diagnostic procedure to examine the ear canal and eardrum using an otoscope. It identifies conditions like infections, foreign bodies, ear drum perforations and ear abnormalities. We propose a novel resource efficient deep learning and transformer based framework, Sumotosima (Summarizer for otoscopic images), an end-to-end pipeline for classification followed by summarization. Our framework works on combination of triplet and cross-entropy losses. Additionally, we use Knowledge Enhanced Multimodal BART whose input is fused textual and image embedding. The objective is to provide summaries that are well-suited for patients, ensuring clarity and efficiency in understanding otoscopic images. Given the lack of existing datasets, we have curated our own OCASD (Otoscopic Classification And Summary Dataset), which includes 500 images with 5 unique categories annotated with their class and summaries by Otolaryngologists. Sumotosima achieved a result of 98.03%, which is 7.00%, 3.10%, 3.01% higher than K-Nearest Neighbors, Random Forest and Support Vector Machines, respectively, in classification tasks. For summarization, Sumotosima outperformed GPT-4o and LLaVA by 88.53% and 107.57% in ROUGE scores, respectively. We have made our code and dataset publicly available at https://github.com/anas2908/Sumotosima △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: Work in Progress

arXiv:2408.00498 [pdf, other]

How Effective are Self-Supervised Models for Contact Identification in Videos

Authors: Malitha Gunawardhana, Limalka Sadith, Liel David, Daniel Harari, Muhammad Haris Khan

Abstract: The exploration of video content via Self-Supervised Learning (SSL) models has unveiled a dynamic field of study, emphasizing both the complex challenges and unique opportunities inherent in this area. Despite the growing body of research, the ability of SSL models to detect physical contacts in videos remains largely unexplored, particularly the effectiveness of methods such as downstream supervi… ▽ More The exploration of video content via Self-Supervised Learning (SSL) models has unveiled a dynamic field of study, emphasizing both the complex challenges and unique opportunities inherent in this area. Despite the growing body of research, the ability of SSL models to detect physical contacts in videos remains largely unexplored, particularly the effectiveness of methods such as downstream supervision with linear probing or full fine-tuning. This work aims to bridge this gap by employing eight different convolutional neural networks (CNNs) based video SSL models to identify instances of physical contact within video sequences specifically. The Something-Something v2 (SSv2) and Epic-Kitchen (EK-100) datasets were chosen for evaluating these approaches due to the promising results on UCF101 and HMDB51, coupled with their limited prior assessment on SSv2 and EK-100. Additionally, these datasets feature diverse environments and scenarios, essential for testing the robustness and accuracy of video-based models. This approach not only examines the effectiveness of each model in recognizing physical contacts but also explores the performance in the action recognition downstream task. By doing so, valuable insights into the adaptability of SSL models in interpreting complex, dynamic visual information are contributed. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: 15 pages, 6 figures

arXiv:2407.17595 [pdf, other]

Measurement of the $^8$B Solar Neutrino Flux Using the Full SNO+ Water Phase

Authors: SNO+ Collaboration, :, A. Allega, M. R. Anderson, S. Andringa, M. Askins, D. J. Auty, A. Bacon, J. Baker, F. Barão, N. Barros, R. Bayes, E. W. Beier, A. Bialek, S. D. Biller, E. Blucher, E. Caden, E. J. Callaghan, M. Chen, S. Cheng, B. Cleveland, D. Cookman, J. Corning, M. A. Cox, R. Dehghani , et al. (93 additional authors not shown)

Abstract: The SNO+ detector operated initially as a water Cherenkov detector. The implementation of a sealed covergas system midway through water data taking resulted in a significant reduction in the activity of $^{222}$Rn daughters in the detector and allowed the lowest background to the solar electron scattering signal above 5 MeV achieved to date. This paper reports an updated SNO+ water phase $^8$B sol… ▽ More The SNO+ detector operated initially as a water Cherenkov detector. The implementation of a sealed covergas system midway through water data taking resulted in a significant reduction in the activity of $^{222}$Rn daughters in the detector and allowed the lowest background to the solar electron scattering signal above 5 MeV achieved to date. This paper reports an updated SNO+ water phase $^8$B solar neutrino analysis with a total livetime of 282.4 days and an analysis threshold of 3.5 MeV. The $^8$B solar neutrino flux is found to be $\left(2.32^{+0.18}_{-0.17}\text{(stat.)}^{+0.07}_{-0.05}\text{(syst.)}\right)\times10^{6}$ cm$^{-2}$s$^{-1}$ assuming no neutrino oscillations, or $\left(5.36^{+0.41}_{-0.39}\text{(stat.)}^{+0.17}_{-0.16}\text{(syst.)} \right)\times10^{6}$ cm$^{-2}$s$^{-1}$ assuming standard neutrino oscillation parameters, in good agreement with both previous measurements and Standard Solar Model Calculations. The electron recoil spectrum is presented above 3.5 MeV. △ Less

Submitted 24 July, 2024; originally announced July 2024.

arXiv:2407.15390 [pdf, other]

ALLaM: Large Language Models for Arabic and English

Authors: M Saiful Bari, Yazeed Alnumay, Norah A. Alzahrani, Nouf M. Alotaibi, Hisham A. Alyahya, Sultan AlRashed, Faisal A. Mirza, Shaykhah Z. Alsubaie, Hassan A. Alahmed, Ghadah Alabduljabbar, Raghad Alkhathran, Yousef Almushayqih, Raneem Alnajim, Salman Alsubaihi, Maryam Al Mansour, Majed Alrubaian, Ali Alammari, Zaki Alawami, Abdulmohsen Al-Thubaity, Ahmed Abdelali, Jeril Kuriakose, Abdalghani Abujabal, Nora Al-Twairesh, Areeb Alowisheq, Haidar Khan

Abstract: We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture… ▽ More We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT). ALLaM is carefully trained considering the values of language alignment and knowledge transfer at scale. Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining on a mixture of Arabic and English text can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English). Furthermore, we highlight the effectiveness of using parallel/translated data to aid the process of knowledge alignment between languages. Finally, we show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment. ALLaM achieves state-of-the-art performance in various Arabic benchmarks, including MMLU Arabic, ACVA, and Arabic Exams. Our aligned models improve both in Arabic and English from their base aligned models. △ Less

Submitted 22 July, 2024; originally announced July 2024.

arXiv:2407.13813 [pdf, other]

A review of handcrafted and deep radiomics in neurological diseases: transitioning from oncology to clinical neuroimaging

Authors: Elizaveta Lavrova, Henry C. Woodruff, Hamza Khan, Eric Salmon, Philippe Lambin, Christophe Phillips

Abstract: Medical imaging technologies have undergone extensive development, enabling non-invasive visualization of clinical information. The traditional review of medical images by clinicians remains subjective, time-consuming, and prone to human error. With the recent availability of medical imaging data, quantification have become important goals in the field. Radiomics, a methodology aimed at extracting… ▽ More Medical imaging technologies have undergone extensive development, enabling non-invasive visualization of clinical information. The traditional review of medical images by clinicians remains subjective, time-consuming, and prone to human error. With the recent availability of medical imaging data, quantification have become important goals in the field. Radiomics, a methodology aimed at extracting quantitative information from imaging data, has emerged as a promising approach to uncover hidden biological information and support decision-making in clinical practice. This paper presents a review of the radiomic pipeline from the clinical neuroimaging perspective, providing a detailed overview of each step with practical advice. It discusses the application of handcrafted and deep radiomics in neuroimaging, stratified by neurological diagnosis. Although radiomics shows great potential for increasing diagnostic precision and improving treatment quality in neurology, several limitations hinder its clinical implementation. Addressing these challenges requires collaborative efforts, advancements in image harmonization methods, and the establishment of reproducible and standardized pipelines with transparent reporting. By overcoming these obstacles, radiomics can significantly impact clinical neurology and enhance patient care. △ Less

Submitted 18 July, 2024; originally announced July 2024.

arXiv:2407.13715 [pdf, other]

Attention Based Simple Primitives for Open World Compositional Zero-Shot Learning

Authors: Ans Munir, Faisal Z. Qureshi, Muhammad Haris Khan, Mohsen Ali

Abstract: Compositional Zero-Shot Learning (CZSL) aims to predict unknown compositions made up of attribute and object pairs. Predicting compositions unseen during training is a challenging task. We are exploring Open World Compositional Zero-Shot Learning (OW-CZSL) in this study, where our test space encompasses all potential combinations of attributes and objects. Our approach involves utilizing the self-… ▽ More Compositional Zero-Shot Learning (CZSL) aims to predict unknown compositions made up of attribute and object pairs. Predicting compositions unseen during training is a challenging task. We are exploring Open World Compositional Zero-Shot Learning (OW-CZSL) in this study, where our test space encompasses all potential combinations of attributes and objects. Our approach involves utilizing the self-attention mechanism between attributes and objects to achieve better generalization from seen to unseen compositions. Utilizing a self-attention mechanism facilitates the model's ability to identify relationships between attribute and objects. The similarity between the self-attended textual and visual features is subsequently calculated to generate predictions during the inference phase. The potential test space may encompass implausible object-attribute combinations arising from unrestricted attribute-object pairings. To mitigate this issue, we leverage external knowledge from ConceptNet to restrict the test space to realistic compositions. Our proposed model, Attention-based Simple Primitives (ASP), demonstrates competitive performance, achieving results comparable to the state-of-the-art. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: 10 pages, 6 figures

arXiv:2407.11283 [pdf, other]

Novel Approach for Predicting the Air Quality Index of Megacities through Attention-Enhanced Deep Multitask Spatiotemporal Learning

Authors: Harun Khan, Joseph Tso, Nathan Nguyen, Nivaan Kaushal, Ansh Malhotra, Nayel Rehman

Abstract: Air pollution remains one of the most formidable environmental threats to human health globally, particularly in urban areas, contributing to nearly 7 million premature deaths annually. Megacities, defined as cities with populations exceeding 10 million, are frequent hotspots of severe pollution, experiencing numerous weeks of dangerously poor air quality due to the concentration of harmful pollut… ▽ More Air pollution remains one of the most formidable environmental threats to human health globally, particularly in urban areas, contributing to nearly 7 million premature deaths annually. Megacities, defined as cities with populations exceeding 10 million, are frequent hotspots of severe pollution, experiencing numerous weeks of dangerously poor air quality due to the concentration of harmful pollutants. In addition, the complex interplay of factors makes accurate air quality predictions incredibly challenging, and prediction models often struggle to capture these intricate dynamics. To address these challenges, this paper proposes an attention-enhanced deep multitask spatiotemporal machine learning model based on long-short-term memory networks for long-term air quality monitoring and prediction. The model demonstrates robust performance in predicting the levels of major pollutants such as sulfur dioxide and carbon monoxide, effectively capturing complex trends and fluctuations. The proposed model provides actionable information for policymakers, enabling informed decision making to improve urban air quality. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 6 pages, 3 figures, 3 tables

arXiv:2407.06436 [pdf]

Simplifying Integration of Custom Controllers in Exergames

Authors: Hassan Ali Khan, Muhammad Asbar Javed, Amnah Khan

Abstract: Despite of the established evidence in favor of exergames for physical rehabilitation their use is limited in Pakistan. In our user study with game developers (N=62), majority (67.7%) of the participants believed that exergames' popularity will increase if cheap alternatives of body tracking devices are available. Perhaps, custom controllers can be used as an affordable alternate input source in e… ▽ More Despite of the established evidence in favor of exergames for physical rehabilitation their use is limited in Pakistan. In our user study with game developers (N=62), majority (67.7%) of the participants believed that exergames' popularity will increase if cheap alternatives of body tracking devices are available. Perhaps, custom controllers can be used as an affordable alternate input source in exergames but the lack of hardware programming knowledge and shortage of experience in the embedded programming attribute to the little involvement of game developers (11.3% of the participants) in the area of exergames. This paper presents a library for the integration of Arduino based (open-source and low-cost) tailored controllers to be used as input source in Unity3D (most preferred game development engine by 88.7% participants) based exergames. The interface to the library proposes a flexible and easy structure for programming and serve as a template application for a range of exergames. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.04519 [pdf, other]

Success or Failure? Analyzing Segmentation Refinement with Few-Shot Segmentation

Authors: Seonghyeon Moon, Haein Kong, Muhammad Haris Khan

Abstract: The purpose of segmentation refinement is to enhance the initial coarse masks generated by segmentation algorithms. The refined masks are expected to capture the details and contours of the target objects. Research on segmentation refinement has developed as a response to the need for high-quality initial masks. However, to our knowledge, no method has been developed that can determine the success… ▽ More The purpose of segmentation refinement is to enhance the initial coarse masks generated by segmentation algorithms. The refined masks are expected to capture the details and contours of the target objects. Research on segmentation refinement has developed as a response to the need for high-quality initial masks. However, to our knowledge, no method has been developed that can determine the success of segmentation refinement. Such a method could ensure the reliability of segmentation in applications where the outcome of the segmentation is important, and fosters innovation in image processing technologies. To address this research gap, we propose JFS~(Judging From Support-set), a method to identify the success of segmentation refinement leveraging a few-shot segmentation (FSS) model. The traditional goal of the problem in FSS is to find a target object in a query image utilizing target information given by a support set. However, in our proposed method, we use the FSS network in a novel way to assess the segmentation refinement. When there are two masks, a coarse mask and a refined mask from segmentation refinement, these two masks become support masks. The existing support mask works as a ground truth mask to judge whether the quality of the refined segmentation is more accurate than the coarse mask. We first obtained a coarse mask and refined it using SEPL (SAM Enhanced Pseduo-Labels) to get the two masks. Then, these become input to FSS model to judge whether the post-processing was successful. JFS is evaluated on the best and worst cases from SEPL to validate its effectiveness. The results showed that JFS can determine whether the SEPL is a success or not. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 4 pages

arXiv:2407.04069 [pdf, other]

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Authors: Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty, Jimmy Huang

Abstract: Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the comple… ▽ More Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.01440 [pdf, other]

GAT-Steiner: Rectilinear Steiner Minimal Tree Prediction Using GNNs

Authors: Bugra Onal, Eren Dogan, Muhammad Hadir Khan, Matthew R. Guthaus

Abstract: The Rectilinear Steiner Minimum Tree (RSMT) problem is a fundamental problem in VLSI placement and routing and is known to be NP-hard. Traditional RSMT algorithms spend a significant amount of time on finding Steiner points to reduce the total wire length or use heuristics to approximate producing sub-optimal results. We show that Graph Neural Networks (GNNs) can be used to predict optimal Steiner… ▽ More The Rectilinear Steiner Minimum Tree (RSMT) problem is a fundamental problem in VLSI placement and routing and is known to be NP-hard. Traditional RSMT algorithms spend a significant amount of time on finding Steiner points to reduce the total wire length or use heuristics to approximate producing sub-optimal results. We show that Graph Neural Networks (GNNs) can be used to predict optimal Steiner points in RSMTs with high accuracy and can be parallelized on GPUs. In this paper, we propose GAT-Steiner, a graph attention network model that correctly predicts 99.846% of the nets in the ISPD19 benchmark with an average increase in wire length of only 0.480% on suboptimal wire length nets. On randomly generated benchmarks, GAT-Steiner correctly predicts 99.942% with an average increase in wire length of only 0.420% on suboptimal wire length nets. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Preprint for The 2024 IEEE/ACM International Conference on Computer-Aided Design (ICCAD 2024)

arXiv:2406.17190 [pdf, other]

Sound Tagging in Infant-centric Home Soundscapes

Authors: Mohammad Nur Hossain Khan, Jialu Li, Nancy L. McElwain, Mark Hasegawa-Johnson, Bashima Islam

Abstract: Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or… ▽ More Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or young children in the environment or have data collected from only a single family where noise from the fixed sound source can be moderate at the infant's position or vice versa. Thus, despite the recent success of large pre-trained models for noise event detection, the performance of these models on infant-centric noise soundscapes in the home is yet to be explored. To bridge this gap, we have collected and labeled noises in home soundscapes from 22 families in an unobtrusive manner, where the data are collected through an infant-worn recording device. In this paper, we explore the performance of a large pre-trained model (Audio Spectrogram Transformer [AST]) on our noise-conditioned infant-centric environmental data as well as publicly available home environmental datasets. Utilizing different training strategies such as resampling, utilizing public datasets, mixing public and infant-centric training sets, and data augmentation using noise and masking, we evaluate the performance of a large pre-trained model on sparse and imbalanced infant-centric data. Our results show that fine-tuning the large pre-trained model by combining our collected dataset with public datasets increases the F1-score from 0.11 (public datasets) and 0.76 (collected datasets) to 0.84 (combined datasets) and Cohen's Kappa from 0.013 (public datasets) and 0.77 (collected datasets) to 0.83 (combined datasets) compared to only training with public or collected datasets, respectively. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted in IEEE/ACM CHASE 2024

arXiv:2406.14498 [pdf, other]

LLaSA: Large Multimodal Agent for Human Activity Analysis Through Wearable Sensors

Authors: Sheikh Asif Imran, Mohammad Nur Hossain Khan, Subrata Biswas, Bashima Islam

Abstract: Integrating inertial measurement units (IMUs) with large language models (LLMs) advances multimodal AI by enhancing human activity understanding. We introduce SensorCaps, a dataset of 26,288 IMU-derived activity narrations, and OpenSQA, an instruction-following dataset with 257,562 question-answer pairs. Combining LIMU-BERT and Llama, we develop LLaSA, a Large Multimodal Agent capable of interpret… ▽ More Integrating inertial measurement units (IMUs) with large language models (LLMs) advances multimodal AI by enhancing human activity understanding. We introduce SensorCaps, a dataset of 26,288 IMU-derived activity narrations, and OpenSQA, an instruction-following dataset with 257,562 question-answer pairs. Combining LIMU-BERT and Llama, we develop LLaSA, a Large Multimodal Agent capable of interpreting and responding to activity and motion analysis queries. Our evaluation demonstrates LLaSA's effectiveness in activity classification and question answering, highlighting its potential in healthcare, sports science, and human-computer interaction. These contributions advance sensor-aware language models and open new research avenues. Our code repository and datasets can be found on https://github.com/BASHLab/LLaSA. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Under review at ARR (for EMNLP 2024)

arXiv:2406.08775 [pdf, other]

ALINA: Advanced Line Identification and Notation Algorithm

Authors: Mohammed Abdul Hafeez Khan, Parth Ganeriwala, Siddhartha Bhattacharyya, Natasha Neogi, Raja Muthalagu

Abstract: Labels are the cornerstone of supervised machine learning algorithms. Most visual recognition methods are fully supervised, using bounding boxes or pixel-wise segmentations for object localization. Traditional labeling methods, such as crowd-sourcing, are prohibitive due to cost, data privacy, amount of time, and potential errors on large datasets. To address these issues, we propose a novel annot… ▽ More Labels are the cornerstone of supervised machine learning algorithms. Most visual recognition methods are fully supervised, using bounding boxes or pixel-wise segmentations for object localization. Traditional labeling methods, such as crowd-sourcing, are prohibitive due to cost, data privacy, amount of time, and potential errors on large datasets. To address these issues, we propose a novel annotation framework, Advanced Line Identification and Notation Algorithm (ALINA), which can be used for labeling taxiway datasets that consist of different camera perspectives and variable weather attributes (sunny and cloudy). Additionally, the CIRCular threshoLd pixEl Discovery And Traversal (CIRCLEDAT) algorithm has been proposed, which is an integral step in determining the pixels corresponding to taxiway line markings. Once the pixels are identified, ALINA generates corresponding pixel coordinate annotations on the frame. Using this approach, 60,249 frames from the taxiway dataset, AssistTaxi have been labeled. To evaluate the performance, a context-based edge map (CBEM) set was generated manually based on edge features and connectivity. The detection rate after testing the annotated labels with the CBEM set was recorded as 98.45%, attesting its dependability and effectiveness. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Paper has been accepted to The 3rd CVPR Workshop on Vision Datasets Understanding, 2024

arXiv:2406.06533 [pdf, other]

Pragmatic Formal Verification Methodology for Clock Domain Crossing (CDC)

Authors: Aman Kumar, Muhammad Ul Haque Khan, Bijitendra Mittra

Abstract: Modern System-on-Chip (SoC) designs are becoming more and more complex due to the technology upscaling. SoC designs often operate on multiple asynchronous clock domains, further adding to the complexity of the overall design. To make the devices power efficient, designers take a Globally-Asynchronous Locally-Synchronous (GALS) approach that creates multiple asynchronous domains. These Clock Domain… ▽ More Modern System-on-Chip (SoC) designs are becoming more and more complex due to the technology upscaling. SoC designs often operate on multiple asynchronous clock domains, further adding to the complexity of the overall design. To make the devices power efficient, designers take a Globally-Asynchronous Locally-Synchronous (GALS) approach that creates multiple asynchronous domains. These Clock Domain Crossings (CDC) are prone to metastability effects, and functional verification of such CDC is very important to ensure that no bug escapes. Conventional verification methods, such as register transfer level (RTL) simulations and static timing analysis, are not enough to address these CDC issues, which may lead to verification gaps. Additionally, identifying these CDC-related bugs is very time-consuming and is one of the most common reasons for costly silicon re-spins. This paper is focused on the development of a pragmatic formal verification methodology to minimize the CDC issues by exercising Metastability Injection (MSI) in different CDC paths. △ Less

Submitted 20 April, 2024; originally announced June 2024.

Comments: Published in DVCon Europe 2023

arXiv:2405.19700 [pdf, other]

Initial measurement of reactor antineutrino oscillation at SNO+

Authors: SNO+ Collaboration, :, A. Allega, M. R. Anderson, S. Andringa, M. Askins, D. J. Auty, A. Bacon, J. Baker, F. Barão, N. Barros, R. Bayes, E. W. Beier, T. S. Bezerra, A. Bialek, S. D. Biller, E. Blucher, E. Caden, E. J. Callaghan, M. Chen, S. Cheng, B. Cleveland, D. Cookman, J. Corning, M. A. Cox , et al. (96 additional authors not shown)

Abstract: The SNO+ collaboration reports its first spectral analysis of long-baseline reactor antineutrino oscillation using 114 tonne-years of data. Fitting the neutrino oscillation probability to the observed energy spectrum yields constraints on the neutrino mass-squared difference $Δm^2_{21}$. In the ranges allowed by previous measurements, the best-fit $Δm^2_{21}$ is (8.85$^{+1.10}_{-1.33}$) $\times$ 1… ▽ More The SNO+ collaboration reports its first spectral analysis of long-baseline reactor antineutrino oscillation using 114 tonne-years of data. Fitting the neutrino oscillation probability to the observed energy spectrum yields constraints on the neutrino mass-squared difference $Δm^2_{21}$. In the ranges allowed by previous measurements, the best-fit $Δm^2_{21}$ is (8.85$^{+1.10}_{-1.33}$) $\times$ 10$^{-5}$ eV$^2$. This measurement is continuing in the next phases of SNO+ and is expected to surpass the present global precision on $Δm^2_{21}$ with about three years of data. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.19292 [pdf, other]

Act Natural! Projecting Autonomous System Trajectories Into Naturalistic Behavior Sets

Authors: Hamzah I. Khan, Adam J. Thorpe, David Fridovich-Keil

Abstract: Autonomous agents operating around human actors must consider how their behaviors might affect those humans, even when not directly interacting with them. To this end, it is often beneficial to be predictable and appear naturalistic. Existing methods to address this problem use human actor intent modeling or imitation learning techniques, but these approaches rarely capture all possible motivation… ▽ More Autonomous agents operating around human actors must consider how their behaviors might affect those humans, even when not directly interacting with them. To this end, it is often beneficial to be predictable and appear naturalistic. Existing methods to address this problem use human actor intent modeling or imitation learning techniques, but these approaches rarely capture all possible motivations for human behavior or require significant amounts of data. In contrast, we propose a technique for modeling naturalistic behavior as a set of convex hulls computed over a relatively small dataset of human behavior. Given this set, we design an optimization-based filter which projects arbitrary trajectories into it to make them more naturalistic for autonomous agents to execute while also satisfying dynamics constraints. We demonstrate our methods on real-world human driving data from the inD intersection dataset (Bock et al., 2020). △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.14497 [pdf, other]

Improving Single Domain-Generalized Object Detection: A Focus on Diversification and Alignment

Authors: Muhammad Sohail Danish, Muhammad Haris Khan, Muhammad Akhtar Munir, M. Saquib Sarfraz, Mohsen Ali

Abstract: In this work, we tackle the problem of domain generalization for object detection, specifically focusing on the scenario where only a single source domain is available. We propose an effective approach that involves two key steps: diversifying the source domain and aligning detections based on class prediction confidence and localization. Firstly, we demonstrate that by carefully selecting a set o… ▽ More In this work, we tackle the problem of domain generalization for object detection, specifically focusing on the scenario where only a single source domain is available. We propose an effective approach that involves two key steps: diversifying the source domain and aligning detections based on class prediction confidence and localization. Firstly, we demonstrate that by carefully selecting a set of augmentations, a base detector can outperform existing methods for single domain generalization by a good margin. This highlights the importance of domain diversification in improving the performance of object detectors. Secondly, we introduce a method to align detections from multiple views, considering both classification and localization outputs. This alignment procedure leads to better generalized and well-calibrated object detector models, which are crucial for accurate decision-making in safety-critical applications. Our approach is detector-agnostic and can be seamlessly applied to both single-stage and two-stage detectors. To validate the effectiveness of our proposed methods, we conduct extensive experiments and ablations on challenging domain-shift scenarios. The results consistently demonstrate the superiority of our approach compared to existing methods. Our code and models are available at: https://github.com/msohaildanish/DivAlign △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.14323 [pdf, other]

SmartCS: Enabling the Creation of ML-Powered Computer Vision Mobile Apps for Citizen Science Applications without Coding

Authors: Fahim Hasan Khan, Akila de Silva, Gregory Dusek, James Davis, Alex Pang

Abstract: It is undeniable that citizen science contributes to the advancement of various fields of study. There are now software tools that facilitate the development of citizen science apps. However, apps developed with these tools rely on individual human skills to correctly collect useful data. Machine learning (ML)-aided apps provide on-field guidance to citizen scientists on data collection tasks. How… ▽ More It is undeniable that citizen science contributes to the advancement of various fields of study. There are now software tools that facilitate the development of citizen science apps. However, apps developed with these tools rely on individual human skills to correctly collect useful data. Machine learning (ML)-aided apps provide on-field guidance to citizen scientists on data collection tasks. However, these apps rely on server-side ML support, and therefore need a reliable internet connection. Furthermore, the development of citizen science apps with ML support requires a significant investment of time and money. For some projects, this barrier may preclude the use of citizen science effectively. We present a platform that democratizes citizen science by making it accessible to a much broader audience of both researchers and participants. The SmartCS platform allows one to create citizen science apps with ML support quickly and without coding skills. Apps developed using SmartCS have client-side ML support, making them usable in the field, even when there is no internet connection. The client-side ML helps educate users to better recognize the subjects, thereby enabling high-quality data collection. We present several citizen science apps created using SmartCS, some of which were conceived and created by high school students. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.13518 [pdf, other]

PerSense: Personalized Instance Segmentation in Dense Images

Authors: Muhammad Ibraheem Siddiqui, Muhammad Umer Sheikh, Hassan Abid, Muhammad Haris Khan

Abstract: Leveraging large-scale pre-training, vision foundational models showcase notable performance benefits. While recent years have witnessed significant advancements in segmentation algorithms, existing models still face challenges to automatically segment personalized instances in dense and crowded scenarios. The primary factor behind this limitation stems from bounding box-based detections, which ar… ▽ More Leveraging large-scale pre-training, vision foundational models showcase notable performance benefits. While recent years have witnessed significant advancements in segmentation algorithms, existing models still face challenges to automatically segment personalized instances in dense and crowded scenarios. The primary factor behind this limitation stems from bounding box-based detections, which are constrained by occlusions, background clutter, and object orientation, particularly when dealing with dense images. To this end, we propose PerSense, an end-to-end, training-free, and model-agnostic one-shot framework to address the personalized instance segmentation in dense images. Towards developing this framework, we make following core contributions. (a) We propose an Instance Detection Module (IDM) and leverage a Vision-Language Model, a grounding object detector, and a few-shot object counter (FSOC) to realize a new baseline. (b) To tackle false positives within candidate point prompts, we design Point Prompt Selection Module (PPSM). Both IDM and PPSM transform density maps from FSOC into personalized instance-level point prompts for segmentation and offer a seamless integration in our model-agnostic framework. (c) We introduce a feedback mechanism which enables PerSense to harness the full potential of FSOC by automating the exemplar selection process. (d) To promote algorithmic advances and effective tools for this relatively underexplored task, we introduce PerSense-D, a dataset exclusive to personalized instance segmentation in dense images. We validate the effectiveness of PerSense on the task of personalized instance segmentation in dense images on PerSense-D and comparison with SOTA. Additionally, our qualitative findings demonstrate the adaptability of our framework to images captured in-the-wild. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: Technical report of PerSense

arXiv:2405.12986 [pdf]

A Novel Feature Map Enhancement Technique Integrating Residual CNN and Transformer for Alzheimer Diseases Diagnosis

Authors: Saddam Hussain Khan

Abstract: Alzheimer diseases (ADs) involves cognitive decline and abnormal brain protein accumulation, necessitating timely diagnosis for effective treatment. Therefore, CAD systems leveraging deep learning advancements have demonstrated success in AD detection but pose computational intricacies and the dataset minor contrast, structural, and texture variations. In this regard, a novel hybrid FME-Residual-H… ▽ More Alzheimer diseases (ADs) involves cognitive decline and abnormal brain protein accumulation, necessitating timely diagnosis for effective treatment. Therefore, CAD systems leveraging deep learning advancements have demonstrated success in AD detection but pose computational intricacies and the dataset minor contrast, structural, and texture variations. In this regard, a novel hybrid FME-Residual-HSCMT technique is introduced, comprised of residual CNN and Transformer concepts to capture global and local fine-grained AD analysis in MRI. This approach integrates three distinct elements: a novel CNN Meet Transformer (HSCMT), customized residual learning CNN, and a new Feature Map Enhancement (FME) strategy to learn diverse morphological, contrast, and texture variations of ADs. The proposed HSCMT at the initial stage utilizes stem convolution blocks that are integrated with CMT blocks followed by systematic homogenous and structural (HS) operations. The customized CMT block encapsulates each element with global contextual interactions through multi-head attention and facilitates computational efficiency through lightweight. Moreover, inverse residual and stem CNN in customized CMT enables effective extraction of local texture information and handling vanishing gradients. Furthermore, in the FME strategy, residual CNN blocks utilize TL-based generated auxiliary and are combined with the proposed HSCMT channels at the target level to achieve diverse enriched feature space. Finally, diverse enhanced channels are fed into a novel spatial attention mechanism for optimal pixel selection to reduce redundancy and discriminate minor contrast and texture inter-class variation. The proposed achieves an F1-score (98.55%), an accuracy of 98.42% and a sensitivity of 98.50%, a precision of 98.60% on the standard Kaggle dataset, and demonstrates outperformance existing ViTs and CNNs methods. △ Less

Submitted 25 May, 2024; v1 submitted 30 March, 2024; originally announced May 2024.

Comments: 28 Pages, 11 Figures, 3 Tables

arXiv:2405.11829 [pdf, other]

Adversarially Diversified Rehearsal Memory (ADRM): Mitigating Memory Overfitting Challenge in Continual Learning

Authors: Hikmat Khan, Ghulam Rasool, Nidhal Carla Bouaynaya

Abstract: Continual learning focuses on learning non-stationary data distribution without forgetting previous knowledge. Rehearsal-based approaches are commonly used to combat catastrophic forgetting. However, these approaches suffer from a problem called "rehearsal memory overfitting, " where the model becomes too specialized on limited memory samples and loses its ability to generalize effectively. As a r… ▽ More Continual learning focuses on learning non-stationary data distribution without forgetting previous knowledge. Rehearsal-based approaches are commonly used to combat catastrophic forgetting. However, these approaches suffer from a problem called "rehearsal memory overfitting, " where the model becomes too specialized on limited memory samples and loses its ability to generalize effectively. As a result, the effectiveness of the rehearsal memory progressively decays, ultimately resulting in catastrophic forgetting of the learned tasks. We introduce the Adversarially Diversified Rehearsal Memory (ADRM) to address the memory overfitting challenge. This novel method is designed to enrich memory sample diversity and bolster resistance against natural and adversarial noise disruptions. ADRM employs the FGSM attacks to introduce adversarially modified memory samples, achieving two primary objectives: enhancing memory diversity and fostering a robust response to continual feature drifts in memory samples. Our contributions are as follows: Firstly, ADRM addresses overfitting in rehearsal memory by employing FGSM to diversify and increase the complexity of the memory buffer. Secondly, we demonstrate that ADRM mitigates memory overfitting and significantly improves the robustness of CL models, which is crucial for safety-critical applications. Finally, our detailed analysis of features and visualization demonstrates that ADRM mitigates feature drifts in CL memory samples, significantly reducing catastrophic forgetting and resulting in a more resilient CL model. Additionally, our in-depth t-SNE visualizations of feature distribution and the quantification of the feature similarity further enrich our understanding of feature representation in existing CL approaches. Our code is publically available at https://github.com/hikmatkhan/ADRM. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.07698 [pdf, other]

oTTC: Object Time-to-Contact for Motion Estimation in Autonomous Driving

Authors: Abdul Hannan Khan, Syed Tahseen Raza Rizvi, Dheeraj Varma Chittari Macharavtu, Andreas Dengel

Abstract: Autonomous driving systems require a quick and robust perception of the nearby environment to carry out their routines effectively. With the aim to avoid collisions and drive safely, autonomous driving systems rely heavily on object detection. However, 2D object detections alone are insufficient; more information, such as relative velocity and distance, is required for safer planning. Monocular 3D… ▽ More Autonomous driving systems require a quick and robust perception of the nearby environment to carry out their routines effectively. With the aim to avoid collisions and drive safely, autonomous driving systems rely heavily on object detection. However, 2D object detections alone are insufficient; more information, such as relative velocity and distance, is required for safer planning. Monocular 3D object detectors try to solve this problem by directly predicting 3D bounding boxes and object velocities given a camera image. Recent research estimates time-to-contact in a per-pixel manner and suggests that it is more effective measure than velocity and depth combined. However, per-pixel time-to-contact requires object detection to serve its purpose effectively and hence increases overall computational requirements as two different models need to run. To address this issue, we propose per-object time-to-contact estimation by extending object detection models to additionally predict the time-to-contact attribute for each object. We compare our proposed approach with existing time-to-contact methods and provide benchmarking results on well-known datasets. Our proposed approach achieves higher precision compared to prior art while using a single image. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: 9 pages, 4 figures

arXiv:2405.06919 [pdf, other]

Automating Thematic Analysis: How LLMs Analyse Controversial Topics

Authors: Awais Hameed Khan, Hiruni Kegalle, Rhea D'Silva, Ned Watt, Daniel Whelan-Shamy, Lida Ghahremanlou, Liam Magee

Abstract: Large Language Models (LLMs) are promising analytical tools. They can augment human epistemic, cognitive and reasoning abilities, and support 'sensemaking', making sense of a complex environment or subject by analysing large volumes of data with a sensitivity to context and nuance absent in earlier text processing systems. This paper presents a pilot experiment that explores how LLMs can support t… ▽ More Large Language Models (LLMs) are promising analytical tools. They can augment human epistemic, cognitive and reasoning abilities, and support 'sensemaking', making sense of a complex environment or subject by analysing large volumes of data with a sensitivity to context and nuance absent in earlier text processing systems. This paper presents a pilot experiment that explores how LLMs can support thematic analysis of controversial topics. We compare how human researchers and two LLMs GPT-4 and Llama 2 categorise excerpts from media coverage of the controversial Australian Robodebt scandal. Our findings highlight intriguing overlaps and variances in thematic categorisation between human and machine agents, and suggest where LLMs can be effective in supporting forms of discourse and thematic analysis. We argue LLMs should be used to augment, and not replace human interpretation, and we add further methodological insights and reflections to existing research on the application of automation to qualitative research methods. We also introduce a novel card-based design toolkit, for both researchers and practitioners to further interrogate LLMs as analytical tools. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: 18 pages, 6 figures

ACM Class: K.4.2

arXiv:2404.14588 [pdf]

Brain-Inspired Continual Learning-Robust Feature Distillation and Re-Consolidation for Class Incremental Learning

Authors: Hikmat Khan, Nidhal Carla Bouaynaya, Ghulam Rasool

Abstract: Artificial intelligence (AI) and neuroscience share a rich history, with advancements in neuroscience shaping the development of AI systems capable of human-like knowledge retention. Leveraging insights from neuroscience and existing research in adversarial and continual learning, we introduce a novel framework comprising two core concepts: feature distillation and re-consolidation. Our framework,… ▽ More Artificial intelligence (AI) and neuroscience share a rich history, with advancements in neuroscience shaping the development of AI systems capable of human-like knowledge retention. Leveraging insights from neuroscience and existing research in adversarial and continual learning, we introduce a novel framework comprising two core concepts: feature distillation and re-consolidation. Our framework, named Robust Rehearsal, addresses the challenge of catastrophic forgetting inherent in continual learning (CL) systems by distilling and rehearsing robust features. Inspired by the mammalian brain's memory consolidation process, Robust Rehearsal aims to emulate the rehearsal of distilled experiences during learning tasks. Additionally, it mimics memory re-consolidation, where new experiences influence the integration of past experiences to mitigate forgetting. Extensive experiments conducted on CIFAR10, CIFAR100, and real-world helicopter attitude datasets showcase the superior performance of CL models trained with Robust Rehearsal compared to baseline methods. Furthermore, examining different optimization training objectives-joint, continual, and adversarial learning-we highlight the crucial role of feature learning in model performance. This underscores the significance of rehearsing CL-robust samples in mitigating catastrophic forgetting. In conclusion, aligning CL approaches with neuroscience insights offers promising solutions to the challenge of catastrophic forgetting, paving the way for more robust and human-like AI systems. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.09790 [pdf, other]

NTIRE 2024 Challenge on Image Super-Resolution ($\times$4): Methods and Results

Authors: Zheng Chen, Zongwei Wu, Eduard Zamfir, Kai Zhang, Yulun Zhang, Radu Timofte, Xiaokang Yang, Hongyuan Yu, Cheng Wan, Yuxin Hong, Zhijuan Huang, Yajun Zou, Yuan Huang, Jiamin Lin, Bingnan Han, Xianyu Guan, Yongsheng Yu, Daoan Zhang, Xuanwu Yin, Kunlong Zuo, Jinhua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou , et al. (63 additional authors not shown)

Abstract: This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge i… ▽ More This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge is to obtain designs/solutions with the most advanced SR performance, with no constraints on computational resources (e.g., model size and FLOPs) or training data. The track of this challenge assesses performance with the PSNR metric on the DIV2K testing dataset. The competition attracted 199 registrants, with 20 teams submitting valid entries. This collective endeavour not only pushes the boundaries of performance in single-image SR but also offers a comprehensive overview of current trends in this field. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: NTIRE 2024 webpage: https://cvlai.net/ntire/2024. Code: https://github.com/zhengchen1999/NTIRE2024_ImageSR_x4

arXiv:2404.09342 [pdf, other]

Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan

Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Salman Tahir, Rohan Kumar Das, Muhammad Zaigham Zaheer, Marta Moscati, Markus Schedl, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf

Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2… ▽ More The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge. △ Less

Submitted 22 July, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: ACM Multimedia Conference - Grand Challenge

arXiv:2404.01352 [pdf, other]

VortexViz: Finding Vortex Boundaries by Learning from Particle Trajectories

Authors: Akila de Silva, Nicholas Tee, Omkar Ghanekar, Fahim Hasan Khan, Gregory Dusek, James Davis, Alex Pang

Abstract: Vortices are studied in various scientific disciplines, offering insights into fluid flow behavior. Visualizing the boundary of vortices is crucial for understanding flow phenomena and detecting flow irregularities. This paper addresses the challenge of accurately extracting vortex boundaries using deep learning techniques. While existing methods primarily train on velocity components, we propose… ▽ More Vortices are studied in various scientific disciplines, offering insights into fluid flow behavior. Visualizing the boundary of vortices is crucial for understanding flow phenomena and detecting flow irregularities. This paper addresses the challenge of accurately extracting vortex boundaries using deep learning techniques. While existing methods primarily train on velocity components, we propose a novel approach incorporating particle trajectories (streamlines or pathlines) into the learning process. By leveraging the regional/local characteristics of the flow field captured by streamlines or pathlines, our methodology aims to enhance the accuracy of vortex boundary extraction. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: Under review

arXiv:2403.16194 [pdf, other]

Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery

Authors: Siddharth Tourani, Ahmed Alwheibi, Arif Mahmood, Muhammad Haris Khan

Abstract: Unsupervised landmarks discovery (ULD) for an object category is a challenging computer vision problem. In pursuit of developing a robust ULD framework, we explore the potential of a recent paradigm of self-supervised learning algorithms, known as diffusion models. Some recent works have shown that these models implicitly contain important correspondence cues. Towards harnessing the potential of d… ▽ More Unsupervised landmarks discovery (ULD) for an object category is a challenging computer vision problem. In pursuit of developing a robust ULD framework, we explore the potential of a recent paradigm of self-supervised learning algorithms, known as diffusion models. Some recent works have shown that these models implicitly contain important correspondence cues. Towards harnessing the potential of diffusion models for the ULD task, we make the following core contributions. First, we propose a ZeroShot ULD baseline based on simple clustering of random pixel locations with nearest neighbour matching. It delivers better results than existing ULD methods. Second, motivated by the ZeroShot performance, we develop a ULD algorithm based on diffusion features using self-training and clustering which also outperforms prior methods by notable margins. Third, we introduce a new proxy task based on generating latent pose codes and also propose a two-stage clustering mechanism to facilitate effective pseudo-labeling, resulting in a significant performance improvement. Overall, our approach consistently outperforms state-of-the-art methods on four challenging benchmarks AFLW, MAFL, CatHeads and LS3D by significant margins. △ Less

Submitted 24 March, 2024; originally announced March 2024.

Comments: Accepted in CVPR 2024

arXiv:2403.11674 [pdf, other]

Towards Generalizing to Unseen Domains with Few Labels

Authors: Chamuditha Jayanga Galappaththige, Sanoojan Baliah, Malitha Gunawardhana, Muhammad Haris Khan

Abstract: We approach the challenge of addressing semi-supervised domain generalization (SSDG). Specifically, our aim is to obtain a model that learns domain-generalizable features by leveraging a limited subset of labelled data alongside a substantially larger pool of unlabeled data. Existing domain generalization (DG) methods which are unable to exploit unlabeled data perform poorly compared to semi-super… ▽ More We approach the challenge of addressing semi-supervised domain generalization (SSDG). Specifically, our aim is to obtain a model that learns domain-generalizable features by leveraging a limited subset of labelled data alongside a substantially larger pool of unlabeled data. Existing domain generalization (DG) methods which are unable to exploit unlabeled data perform poorly compared to semi-supervised learning (SSL) methods under SSDG setting. Nevertheless, SSL methods have considerable room for performance improvement when compared to fully-supervised DG training. To tackle this underexplored, yet highly practical problem of SSDG, we make the following core contributions. First, we propose a feature-based conformity technique that matches the posterior distributions from the feature space with the pseudo-label from the model's output space. Second, we develop a semantics alignment loss to learn semantically-compatible representations by regularizing the semantic structure in the feature space. Our method is plug-and-play and can be readily integrated with different SSL-based SSDG baselines without introducing any additional parameters. Extensive experimental results across five challenging DG benchmarks with four strong SSL baselines suggest that our method provides consistent and notable gains in two different SSDG settings. △ Less

Submitted 7 May, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: Accepted at CVPR 2024

arXiv:2403.07019 [pdf]

Reasons behind the Water Crisis and its Potential Health Outcomes

Authors: Md. Galib Ishraq Emran, Rhidi Barma, Akram Hussain Khan, Mrinmoy Roy

Abstract: Globally, the water crisis has become a significant problem that affects developing and industrialized nations. Water shortage can harm public health by increasing the chance of contracting water-borne diseases, dehydration, and malnutrition. This study aims to examine the causes of the water problem and its likely effects on human health. The study scrutinizes the reasons behind the water crisis,… ▽ More Globally, the water crisis has become a significant problem that affects developing and industrialized nations. Water shortage can harm public health by increasing the chance of contracting water-borne diseases, dehydration, and malnutrition. This study aims to examine the causes of the water problem and its likely effects on human health. The study scrutinizes the reasons behind the water crisis, including population increase, climate change, and inefficient water management techniques. The results of a lack of water on human health, such as the spread of infectious diseases, a higher risk of starvation and dehydration, and psychological stress, are also concealed in the study. The research further suggests several ways to deal with the water situation and lessen its potential outcomes on human health. These remedies include enhanced sanitation and hygiene procedures, water management, and conservation techniques like rainwater gathering and wastewater recycling. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2403.02782 [pdf, other]

Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

Authors: Kumaranage Ravindu Yasas Nagasinghe, Honglu Zhou, Malitha Gunawardhana, Martin Renqiang Min, Daniel Harari, Muhammad Haris Khan

Abstract: In this paper, we explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information av… ▽ More In this paper, we explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information available in the datasets, such as heavy intermediate visual observations, procedural names, or natural language step-by-step instructions, for features or supervision signals. However, the task remains formidable due to the implicit causal constraints in the sequencing of steps and the variability inherent in multiple feasible plans. To tackle these intricacies that previous efforts have overlooked, we propose to enhance the capabilities of the agent by infusing it with procedural knowledge. This knowledge, sourced from training procedure plans and structured as a directed weighted graph, equips the agent to better navigate the complexities of step sequencing and its potential variations. We coin our approach KEPP, a novel Knowledge-Enhanced Procedure Planning system, which harnesses a probabilistic procedural knowledge graph extracted from training data, effectively acting as a comprehensive textbook for the training domain. Experimental evaluations across three widely-used datasets under settings of varying complexity reveal that KEPP attains superior, state-of-the-art results while requiring only minimal supervision. △ Less

Submitted 15 June, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: 8 pages, 6 figures, (supplementary material: 9 pages, 5 figures), accepted to CVPR 2024

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024 , Pages 18816-18826

arXiv:2402.09244 [pdf, other]

Zero-energy Devices for 6G: Technical Enablers at a Glance

Authors: Onel López, Ritesh Kumar Singh, Dinh-Thuy Phan-Huy, Efstathios Katranaras, Nafiseh Mazloum, Riku Jäntti, Hamza Khan, Osmel Rosabal, Pavlos Alexias, Prasoon Raghuwanshi, David Ruiz-Guirola, Bikramjit Singh, Andreas Höglund, Dung Pham Van, Amirhossein Azarbahram, Jeroen Famaey

Abstract: Low-cost, resource-constrained, maintenance-free, and energy-harvesting (EH) Internet of Things (IoT) devices, referred to as zero-energy devices (ZEDs), are rapidly attracting attention from industry and academia due to their myriad of applications. To date, such devices remain primarily unsupported by modern IoT connectivity solutions due to their intrinsic fabrication, hardware, deployment, and… ▽ More Low-cost, resource-constrained, maintenance-free, and energy-harvesting (EH) Internet of Things (IoT) devices, referred to as zero-energy devices (ZEDs), are rapidly attracting attention from industry and academia due to their myriad of applications. To date, such devices remain primarily unsupported by modern IoT connectivity solutions due to their intrinsic fabrication, hardware, deployment, and operation limitations, while lacking clarity on their key technical enablers and prospects. Herein, we address this by discussing the main characteristics and enabling technologies of ZEDs within the next generation of mobile networks, specifically focusing on unconventional EH sources, multi-source EH, power management, energy storage solutions, manufacturing material and practices, backscattering, and low-complexity receivers. Moreover, we highlight the need for lightweight and energy-aware computing, communication, and scheduling protocols, while discussing potential approaches related to TinyML, duty cycling, and infrastructure enablers like radio frequency wireless power transfer and wake-up protocols. Challenging aspects and open research directions are identified and discussed in all the cases. Finally, we showcase an experimental ZED proof-of-concept related to ambient cellular backscattering. △ Less

Submitted 14 February, 2024; originally announced February 2024.

Comments: 8 pages, 4 Figures

arXiv:2402.01781 [pdf, other]

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Authors: Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, Haidar Khan

Abstract: Large Language Model (LLM) leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly) mistake. Under existing leaderboards, the relative performance of LLMs is highly sensitive to (often minute) details. We show that for popular multiple-choice ques… ▽ More Large Language Model (LLM) leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly) mistake. Under existing leaderboards, the relative performance of LLMs is highly sensitive to (often minute) details. We show that for popular multiple-choice question benchmarks (e.g., MMLU), minor perturbations to the benchmark, such as changing the order of choices or the method of answer selection, result in changes in rankings up to 8 positions. We explain this phenomenon by conducting systematic experiments over three broad categories of benchmark perturbations and identifying the sources of this behavior. Our analysis results in several best-practice recommendations, including the advantage of a hybrid scoring method for answer selection. Our study highlights the dangers of relying on simple benchmark evaluations and charts the path for more robust evaluation schemes on the existing benchmarks. The code for this paper is available at https://github.com/National-Center-for-AI-Saudi-Arabia/lm-evaluation-harness. △ Less

Submitted 3 July, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: updated with ACL 2024 camera ready version

arXiv:2402.00128 [pdf, other]

Real-time Traffic Object Detection for Autonomous Driving

Authors: Abdul Hannan Khan, Syed Tahseen Raza Rizvi, Andreas Dengel

Abstract: With recent advances in computer vision, it appears that autonomous driving will be part of modern society sooner rather than later. However, there are still a significant number of concerns to address. Although modern computer vision techniques demonstrate superior performance, they tend to prioritize accuracy over efficiency, which is a crucial aspect of real-time applications. Large object dete… ▽ More With recent advances in computer vision, it appears that autonomous driving will be part of modern society sooner rather than later. However, there are still a significant number of concerns to address. Although modern computer vision techniques demonstrate superior performance, they tend to prioritize accuracy over efficiency, which is a crucial aspect of real-time applications. Large object detection models typically require higher computational power, which is achieved by using more sophisticated onboard hardware. For autonomous driving, these requirements translate to increased fuel costs and, ultimately, a reduction in mileage. Further, despite their computational demands, the existing object detectors are far from being real-time. In this research, we assess the robustness of our previously proposed, highly efficient pedestrian detector LSFM on well-established autonomous driving benchmarks, including diverse weather conditions and nighttime scenes. Moreover, we extend our LSFM model for general object detection to achieve real-time object detection in traffic scenes. We evaluate its performance, low latency, and generalizability on traffic object detection datasets. Furthermore, we discuss the inadequacy of the current key performance indicator employed by object detection systems in the context of autonomous driving and propose a more suitable alternative that incorporates real-time requirements. △ Less

Submitted 29 February, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

Comments: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2401.13965 [pdf, other]

Improving Pseudo-labelling and Enhancing Robustness for Semi-Supervised Domain Generalization

Authors: Adnan Khan, Mai A. Shaaban, Muhammad Haris Khan

Abstract: Beyond attaining domain generalization (DG), visual recognition models should also be data-efficient during learning by leveraging limited labels. We study the problem of Semi-Supervised Domain Generalization (SSDG) which is crucial for real-world applications like automated healthcare. SSDG requires learning a cross-domain generalizable model when the given training data is only partially labelle… ▽ More Beyond attaining domain generalization (DG), visual recognition models should also be data-efficient during learning by leveraging limited labels. We study the problem of Semi-Supervised Domain Generalization (SSDG) which is crucial for real-world applications like automated healthcare. SSDG requires learning a cross-domain generalizable model when the given training data is only partially labelled. Empirical investigations reveal that the DG methods tend to underperform in SSDG settings, likely because they are unable to exploit the unlabelled data. Semi-supervised learning (SSL) shows improved but still inferior results compared to fully-supervised learning. A key challenge, faced by the best-performing SSL-based SSDG methods, is selecting accurate pseudo-labels under multiple domain shifts and reducing overfitting to source domains under limited labels. In this work, we propose new SSDG approach, which utilizes a novel uncertainty-guided pseudo-labelling with model averaging (UPLM). Our uncertainty-guided pseudo-labelling (UPL) uses model uncertainty to improve pseudo-labelling selection, addressing poor model calibration under multi-source unlabelled data. The UPL technique, enhanced by our novel model averaging (MA) strategy, mitigates overfitting to source domains with limited labels. Extensive experiments on key representative DG datasets suggest that our method demonstrates effectiveness against existing methods. Our code and chosen labelled data seeds are available on GitHub: https://github.com/Adnan-Khan7/UPLM △ Less

Submitted 25 January, 2024; originally announced January 2024.

arXiv:2401.13785 [pdf, other]

Unified Spatio-Temporal Tri-Perspective View Representation for 3D Semantic Occupancy Prediction

Authors: Sathira Silva, Savindu Bhashitha Wannigama, Gihan Jayatilaka, Muhammad Haris Khan, Roshan Ragel

Abstract: Holistic understanding and reasoning in 3D scenes play a vital role in the success of autonomous driving systems. The evolution of 3D semantic occupancy prediction as a pretraining task for autonomous driving and robotic downstream tasks capture finer 3D details compared to methods like 3D detection. Existing approaches predominantly focus on spatial cues such as tri-perspective view embeddings (T… ▽ More Holistic understanding and reasoning in 3D scenes play a vital role in the success of autonomous driving systems. The evolution of 3D semantic occupancy prediction as a pretraining task for autonomous driving and robotic downstream tasks capture finer 3D details compared to methods like 3D detection. Existing approaches predominantly focus on spatial cues such as tri-perspective view embeddings (TPV), often overlooking temporal cues. This study introduces a spatiotemporal transformer architecture S2TPVFormer for temporally coherent 3D semantic occupancy prediction. We enrich the prior process by including temporal cues using a novel temporal cross-view hybrid attention mechanism (TCVHA) and generate spatiotemporal TPV embeddings (i.e. S2TPV embeddings). Experimental evaluations on the nuScenes dataset demonstrate a substantial 4.1% improvement in mean Intersection over Union (mIoU) for 3D Semantic Occupancy compared to TPVFormer, confirming the effectiveness of the proposed S2TPVFormer in enhancing 3D scene perception. △ Less

Submitted 4 April, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

arXiv:2401.11621 [pdf]

A Novel Decision Ensemble Framework: Customized Attention-BiLSTM and XGBoost for Speculative Stock Price Forecasting

Authors: Riaz Ud Din, Salman Ahmed, Saddam Hussain Khan

Abstract: Forecasting speculative stock prices is essential for effective investment risk management that drives the need for the development of innovative algorithms. However, the speculative nature, volatility, and complex sequential dependencies within financial markets present inherent challenges which necessitate advanced techniques. This paper proposes a novel framework, CAB-XDE (customized attention… ▽ More Forecasting speculative stock prices is essential for effective investment risk management that drives the need for the development of innovative algorithms. However, the speculative nature, volatility, and complex sequential dependencies within financial markets present inherent challenges which necessitate advanced techniques. This paper proposes a novel framework, CAB-XDE (customized attention BiLSTM-XGB decision ensemble), for predicting the daily closing price of speculative stock Bitcoin-USD (BTC-USD). CAB-XDE framework integrates a customized bi-directional long short-term memory (BiLSTM) with the attention mechanism and the XGBoost algorithm. The customized BiLSTM leverages its learning capabilities to capture the complex sequential dependencies and speculative market trends. Additionally, the new attention mechanism dynamically assigns weights to influential features, thereby enhancing interpretability, and optimizing effective cost measures and volatility forecasting. Moreover, XGBoost handles nonlinear relationships and contributes to the proposed CAB-XDE framework robustness. Additionally, the weight determination theory-error reciprocal method further refines predictions. This refinement is achieved by iteratively adjusting model weights. It is based on discrepancies between theoretical expectations and actual errors in individual customized attention BiLSTM and XGBoost models to enhance performance. Finally, the predictions from both XGBoost and customized attention BiLSTM models are concatenated to achieve diverse prediction space and are provided to the ensemble classifier to enhance the generalization capabilities of CAB-XDE. The proposed CAB-XDE framework is empirically validated on volatile Bitcoin market, sourced from Yahoo Finance and outperforms state-of-the-art models with a MAPE of 0.0037, MAE of 84.40, and RMSE of 106.14. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: 30 pages, 16 Figures, 4 Tables

arXiv:2401.11358 [pdf, other]

ANNA: A Deep Learning Based Dataset in Heterogeneous Traffic for Autonomous Vehicles

Authors: Mahedi Kamal, Tasnim Fariha, Afrina Kabir Zinia, Md. Abu Syed, Fahim Hasan Khan, Md. Mahbubur Rahman

Abstract: Recent breakthroughs in artificial intelligence offer tremendous promise for the development of self-driving applications. Deep Neural Networks, in particular, are being utilized to support the operation of semi-autonomous cars through object identification and semantic segmentation. To assess the inadequacy of the current dataset in the context of autonomous and semi-autonomous cars, we created a… ▽ More Recent breakthroughs in artificial intelligence offer tremendous promise for the development of self-driving applications. Deep Neural Networks, in particular, are being utilized to support the operation of semi-autonomous cars through object identification and semantic segmentation. To assess the inadequacy of the current dataset in the context of autonomous and semi-autonomous cars, we created a new dataset named ANNA. This study discusses a custom-built dataset that includes some unidentified vehicles in the perspective of Bangladesh, which are not included in the existing dataset. A dataset validity check was performed by evaluating models using the Intersection Over Union (IOU) metric. The results demonstrated that the model trained on our custom dataset was more precise and efficient than the models trained on the KITTI or COCO dataset concerning Bangladeshi traffic. The research presented in this paper also emphasizes the importance of developing accurate and efficient object detection algorithms for the advancement of autonomous vehicles. △ Less

Submitted 20 January, 2024; originally announced January 2024.

arXiv:2401.09354 [pdf]

Transcending Controlled Environments Assessing the Transferability of ASRRobust NLU Models to Real-World Applications

Authors: Hania Khan, Aleena Fatima Khalid, Zaryab Hassan

Abstract: This research investigates the transferability of Automatic Speech Recognition (ASR)-robust Natural Language Understanding (NLU) models from controlled experimental conditions to practical, real-world applications. Focused on smart home automation commands in Urdu, the study assesses model performance under diverse noise profiles, linguistic variations, and ASR error scenarios. Leveraging the Urdu… ▽ More This research investigates the transferability of Automatic Speech Recognition (ASR)-robust Natural Language Understanding (NLU) models from controlled experimental conditions to practical, real-world applications. Focused on smart home automation commands in Urdu, the study assesses model performance under diverse noise profiles, linguistic variations, and ASR error scenarios. Leveraging the UrduBERT model, the research employs a systematic methodology involving real-world data collection, cross-validation, transfer learning, noise variation studies, and domain adaptation. Evaluation metrics encompass task-specific accuracy, latency, user satisfaction, and robustness to ASR errors. The findings contribute insights into the challenges and adaptability of ASR-robust NLU models in transcending controlled environments. △ Less

Submitted 12 January, 2024; originally announced January 2024.

arXiv:2401.06084 [pdf, other]

Post-Newtonian effects in compact binaries with a dark matter spike: A Lagrangian approach

Authors: Diego Montalvo, Adam Smith-Orlik, Saeed Rastgoo, Laura Sagunski, Niklas Becker, Hazkeel Khan

Abstract: We present a simple but powerful Lagrangian method that can be used to study the post-Newtonian evolution of a compact binary system with environment, including a dark matter spike, around it, and obtain the resulting gravitational wave emission. This formalism allows one to incorporate post-Newtonian effects up to any desired known order, as well as any other environmental effect around the binar… ▽ More We present a simple but powerful Lagrangian method that can be used to study the post-Newtonian evolution of a compact binary system with environment, including a dark matter spike, around it, and obtain the resulting gravitational wave emission. This formalism allows one to incorporate post-Newtonian effects up to any desired known order, as well as any other environmental effect around the binary, as long as their dissipation power or force formulae are known. In particular, in this work, we employ this method to study a black hole-black hole binary system of mass ratio $10^5$, by including post-Newtonian effects of order 1PN and 2.5PN as well as the effect of relativistic dynamical friction. We obtain the modified orbits and the corresponding modified gravitational waveform. Finally, we contrast these modifications against the LISA sensitivity curve in frequency space and show that this observatory can detect the associated signals. △ Less

Submitted 11 January, 2024; originally announced January 2024.

Comments: 16 pages, 4 figures

arXiv:2312.04695 [pdf]

Foreign Capital and Economic Growth: Evidence from Bangladesh

Authors: Ummya Salma, Md. Fazlul Huq Khan, Md. Masum Billah

Abstract: This study aims to examine the relationship between Foreign Direct Investment (FDI), personal remittances received, and official development assistance (ODA) in the economic growth of Bangladesh. The study utilizes time series data on Bangladesh from 1976 to 2021. Additionally, this research contributes to the existing literature by introducing the Foreign Capital Depthless Index (FCDI) and explor… ▽ More This study aims to examine the relationship between Foreign Direct Investment (FDI), personal remittances received, and official development assistance (ODA) in the economic growth of Bangladesh. The study utilizes time series data on Bangladesh from 1976 to 2021. Additionally, this research contributes to the existing literature by introducing the Foreign Capital Depthless Index (FCDI) and exploring its impact on Bangladesh's economic growth. The results of the Vector Error Correction Model (VECM) suggest that the economic growth of Bangladesh depends on FDI, remittances, and aid in the long run. However, these variables do not exhibit a causal relationship with GDP in the short run. The relationship between FCDI and economic growth is positive in the long run. Nevertheless, the presence of these three variables has a more significant impact on the economic growth of Bangladesh △ Less

Submitted 7 December, 2023; originally announced December 2023.

arXiv:2312.00634 [pdf]

A Recent Survey of Vision Transformers for Medical Image Segmentation

Authors: Asifullah Khan, Zunaira Rauf, Abdul Rehman Khan, Saima Rathore, Saddam Hussain Khan, Najmus Saher Shah, Umair Farooq, Hifsa Asif, Aqsa Asif, Umme Zahoora, Rafi Ullah Khalil, Suleman Qamar, Umme Hani Asif, Faiza Babar Khan, Abdul Majid, Jeonghwan Gwak

Abstract: Medical image segmentation plays a crucial role in various healthcare applications, enabling accurate diagnosis, treatment planning, and disease monitoring. Traditionally, convolutional neural networks (CNNs) dominated this domain, excelling at local feature extraction. However, their limitations in capturing long-range dependencies across image regions pose challenges for segmenting complex, inte… ▽ More Medical image segmentation plays a crucial role in various healthcare applications, enabling accurate diagnosis, treatment planning, and disease monitoring. Traditionally, convolutional neural networks (CNNs) dominated this domain, excelling at local feature extraction. However, their limitations in capturing long-range dependencies across image regions pose challenges for segmenting complex, interconnected structures often encountered in medical data. In recent years, Vision Transformers (ViTs) have emerged as a promising technique for addressing the challenges in medical image segmentation. Their multi-scale attention mechanism enables effective modeling of long-range dependencies between distant structures, crucial for segmenting organs or lesions spanning the image. Additionally, ViTs' ability to discern subtle pattern heterogeneity allows for the precise delineation of intricate boundaries and edges, a critical aspect of accurate medical image segmentation. However, they do lack image-related inductive bias and translational invariance, potentially impacting their performance. Recently, researchers have come up with various ViT-based approaches that incorporate CNNs in their architectures, known as Hybrid Vision Transformers (HVTs) to capture local correlation in addition to the global information in the images. This survey paper provides a detailed review of the recent advancements in ViTs and HVTs for medical image segmentation. Along with the categorization of ViT and HVT-based medical image segmentation approaches, we also present a detailed overview of their real-time applications in several medical image modalities. This survey may serve as a valuable resource for researchers, healthcare practitioners, and students in understanding the state-of-the-art approaches for ViT-based medical image segmentation. △ Less

Submitted 18 December, 2023; v1 submitted 1 December, 2023; originally announced December 2023.

arXiv:2311.10754 [pdf]

A Recent Survey of the Advancements in Deep Learning Techniques for Monkeypox Disease Detection

Authors: Saddam Hussain Khan, Rashid Iqbal, Saeeda Naz

Abstract: Monkeypox (MPox) is a zoonotic infectious disease induced by the MPox Virus, part of the poxviridae orthopoxvirus group initially discovered in Africa and gained global attention in mid-2022 with cases reported outside endemic areas. Symptoms include headaches, chills, fever, smallpox, measles, and chickenpox-like skin manifestations and the WHO officially announced MPox as a global public health… ▽ More Monkeypox (MPox) is a zoonotic infectious disease induced by the MPox Virus, part of the poxviridae orthopoxvirus group initially discovered in Africa and gained global attention in mid-2022 with cases reported outside endemic areas. Symptoms include headaches, chills, fever, smallpox, measles, and chickenpox-like skin manifestations and the WHO officially announced MPox as a global public health pandemic, in July 2022.Traditionally, PCR testing of skin lesions is considered a benchmark for the primary diagnosis by WHO, with symptom management as the primary treatment and antiviral drugs like tecovirimat for severe cases. However, manual analysis within hospitals poses a substantial challenge including the substantial burden on healthcare professionals, limited facilities, availability and fatigue among doctors, and human error during public health emergencies. Therefore, this survey paper provides an extensive and efficient analysis of deep learning (DL) methods for the automatic detection of MPox in skin lesion images. These DL techniques are broadly grouped into categories, including deep CNN, Deep CNNs ensemble, deep hybrid learning, the newly developed, and Vision transformer for diagnosing MPox. Moreover, this study offers a systematic exploration of the evolutionary progression of DL techniques and identifies, and addresses limitations in previous methods while highlighting the valuable contributions and innovation. Additionally, the paper addresses benchmark datasets and their collection from various authentic sources, pre-processing techniques, and evaluation metrics. The survey also briefly delves into emerging concepts, identifies research gaps, limitations, and applications, and outlines challenges in the diagnosis process. This survey furnishes valuable insights into the prospective areas of DL innovative ideas and is anticipated to serve as a path for researchers. △ Less

Submitted 23 November, 2023; v1 submitted 6 November, 2023; originally announced November 2023.

Comments: 53 pages, 16 figures, 7 tables

arXiv:2311.09246 [pdf, other]

doi 10.1109/ISMAR59233.2023.00097

Smell of Fire Increases Behavioural Realism in Virtual Reality: A Case Study on a Recreated MGM Grand Hotel Fire

Authors: Humayun Khan, Daniel Nilsson

Abstract: Virtual reality allows creating highly immersive visual and auditory experiences, making users feel physically present in the environment. This makes it an ideal platform to simulate dangerous scenarios, including fire evacuation, and study human behaviour without exposing users to harmful elements. However, human perception of the surroundings is based on the integration of multiple sensory cues… ▽ More Virtual reality allows creating highly immersive visual and auditory experiences, making users feel physically present in the environment. This makes it an ideal platform to simulate dangerous scenarios, including fire evacuation, and study human behaviour without exposing users to harmful elements. However, human perception of the surroundings is based on the integration of multiple sensory cues (visual, auditory, tactile, or/and olfactory) present in the environment. When some of the sensory stimuli are missing in the virtual experience, it can break the illusion of being there in the environment and could lead to actions that deviate from normal behaviour. In this work, we added an olfactory cue in a well-documented historic hotel fire scenario that was recreated in VR, and examined the effects of the olfactory cue on human behaviour. We conducted a between subject study on 40 naive participants. Our results show that the addition of the olfactory cue could increase behavioural realism. We found that 80% of the studied actions for the VR with olfactory cue condition matched the ones performed by the survivors. In comparison, only 40% of the participants' actions for VR only condition were similar to the survivors. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: Accepted at IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2023, 9 pages

arXiv:2311.09086 [pdf, other]

The Uli Dataset: An Exercise in Experience Led Annotation of oGBV

Authors: Arnav Arora, Maha Jinadoss, Cheshta Arora, Denny George, Brindaalakshmi, Haseena Dawood Khan, Kirti Rawat, Div, Ritash, Seema Mathur, Shivani Yadav, Shehla Rashid Shora, Rie Raut, Sumit Pawar, Apurva Paithane, Sonia, Vivek, Dharini Priscilla, Khairunnisha, Grace Banu, Ambika Tandon, Rishav Thakker, Rahul Dev Korra, Aatman Vaidya, Tarunima Prabhakar

Abstract: Online gender based violence has grown concomitantly with adoption of the internet and social media. Its effects are worse in the Global majority where many users use social media in languages other than English. The scale and volume of conversations on the internet has necessitated the need for automated detection of hate speech, and more specifically gendered abuse. There is, however, a lack of… ▽ More Online gender based violence has grown concomitantly with adoption of the internet and social media. Its effects are worse in the Global majority where many users use social media in languages other than English. The scale and volume of conversations on the internet has necessitated the need for automated detection of hate speech, and more specifically gendered abuse. There is, however, a lack of language specific and contextual data to build such automated tools. In this paper we present a dataset on gendered abuse in three languages- Hindi, Tamil and Indian English. The dataset comprises of tweets annotated along three questions pertaining to the experience of gender abuse, by experts who identify as women or a member of the LGBTQIA community in South Asia. Through this dataset we demonstrate a participatory approach to creating datasets that drive AI systems. △ Less

Submitted 24 June, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

arXiv:2311.06802 [pdf, other]

A hybrid discrete exterior calculus and finite difference method for anelastic convection in spherical shells

Authors: Hamid Hassan Khan, Pankaj Jagad, Matteo Parsani

Abstract: The present work develops, verifies, and benchmarks a hybrid discrete exterior calculus and finite difference (DEC-FD) method for density-stratified thermal convection in spherical shells. Discrete exterior calculus (DEC) is notable for its coordinate independence and structure preservation properties. The hybrid DEC-FD method for Boussinesq convection has been developed by Mantravadi et al. (Mant… ▽ More The present work develops, verifies, and benchmarks a hybrid discrete exterior calculus and finite difference (DEC-FD) method for density-stratified thermal convection in spherical shells. Discrete exterior calculus (DEC) is notable for its coordinate independence and structure preservation properties. The hybrid DEC-FD method for Boussinesq convection has been developed by Mantravadi et al. (Mantravadi, B., Jagad, P., & Samtaney, R. (2023). A hybrid discrete exterior calculus and finite difference method for Boussinesq convection in spherical shells. Journal of Computational Physics, 491, 112397). Motivated by astrophysics problems, we extend this method assuming anelastic convection, which retains density stratification; this has been widely used for decades to understand thermal convection in stars and giant planets. In the present work, the governing equations are splitted into surface and radial components and discrete anelastic equations are derived by replacing spherical surface operators with DEC and radial operators with FD operators. The novel feature of this work is the discretization of anelastic equations with the DEC-FD method and the assessment of a hybrid solver for density-stratified thermal convection in spherical shells. The discretized anelastic equations are verified using the method of manufactured solution (MMS). We performed a series of three-dimensional convection simulations in a spherical shell geometry and examined the effect of density ratio on convective flow structures and energy dynamics. The present observations are in agreement with the benchmark models. △ Less

Submitted 12 November, 2023; originally announced November 2023.

Comments: 32 pages, 13 figures

arXiv:2311.06226 [pdf, other]

MaDEVIoT: Cyberattacks on EV Charging Can Disrupt Power Grid Operation

Authors: Samrat Acharya, Hafiz Anwar Ullah Khan, Ramesh Karri, Yury Dvorkin

Abstract: This paper examines the feasibility of demand-side cyberattacks on power grids launched via internet-connected high-power EV Charging Stations (EVCSs). By distorting power grid frequency and voltage, these attacks can trigger system-wide outages. Our case study focuses on Manhattan, New York, and reveals that such attacks will become feasible by 2030 with increased EV adoption. With a single EVCS… ▽ More This paper examines the feasibility of demand-side cyberattacks on power grids launched via internet-connected high-power EV Charging Stations (EVCSs). By distorting power grid frequency and voltage, these attacks can trigger system-wide outages. Our case study focuses on Manhattan, New York, and reveals that such attacks will become feasible by 2030 with increased EV adoption. With a single EVCS company dominating Manhattan, compromising a single EVCS server raises serious power grid security concerns. These attacks can overload power lines and trip over-frequency (OF) protection relays, resulting in a power grid blackout. This study serves as a crucial resource for planning authorities and power grid operators involved in the EV charging infrastructure roll-out, highlighting potential cyberthreats to power grids stemming from high-power EVCSs. △ Less

Submitted 10 November, 2023; originally announced November 2023.

Comments: This paper is accepted for publication in the proceeding of IEEE ISGT NA 2024 in Washington DC, USA

Showing 1–50 of 359 results for author: Khan, H