Search | arXiv e-print repository

LookupViT: Compressing visual information to a limited number of tokens

Authors: Rajat Koner, Gagan Jain, Prateek Jain, Volker Tresp, Sujoy Paul

Abstract: Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions. But their inference cost can be prohibitive for many settings, as they compute self-attention in each layer which suffers from quadratic computational complexity in the number of tokens. On the other hand, spatial information in images and spatio-temporal information in videos is usually spa… ▽ More Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions. But their inference cost can be prohibitive for many settings, as they compute self-attention in each layer which suffers from quadratic computational complexity in the number of tokens. On the other hand, spatial information in images and spatio-temporal information in videos is usually sparse and redundant. In this work, we introduce LookupViT, that aims to exploit this information sparsity to reduce ViT inference cost. LookupViT provides a novel general purpose vision transformer block that operates by compressing information from higher resolution tokens to a fixed number of tokens. These few compressed tokens undergo meticulous processing, while the higher-resolution tokens are passed through computationally cheaper layers. Information sharing between these two token sets is enabled through a bidirectional cross-attention mechanism. The approach offers multiple advantages - (a) easy to implement on standard ML accelerators (GPUs/TPUs) via standard high-level operators, (b) applicable to standard ViT and its variants, thus generalizes to various tasks, (c) can handle different tokenization and attention approaches. LookupViT also offers flexibility for the compressed tokens, enabling performance-computation trade-offs in a single trained model. We show LookupViT's effectiveness on multiple domains - (a) for image-classification (ImageNet-1K and ImageNet-21K), (b) video classification (Kinetics400 and Something-Something V2), (c) image captioning (COCO-Captions) with a frozen encoder. LookupViT provides $2\times$ reduction in FLOPs while upholding or improving accuracy across these domains. In addition, LookupViT also demonstrates out-of-the-box robustness and generalization on image classification (ImageNet-C,R,A,O), improving by up to $4\%$ over ViT. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: ECCV 2024

arXiv:2407.12113 [pdf, other]

A Graph-based Adversarial Imitation Learning Framework for Reliable & Realtime Fleet Scheduling in Urban Air Mobility

Authors: Prithvi Poddar, Steve Paul, Souma Chowdhury

Abstract: The advent of Urban Air Mobility (UAM) presents the scope for a transformative shift in the domain of urban transportation. However, its widespread adoption and economic viability depends in part on the ability to optimally schedule the fleet of aircraft across vertiports in a UAM network, under uncertainties attributed to airspace congestion, changing weather conditions, and varying demands. This… ▽ More The advent of Urban Air Mobility (UAM) presents the scope for a transformative shift in the domain of urban transportation. However, its widespread adoption and economic viability depends in part on the ability to optimally schedule the fleet of aircraft across vertiports in a UAM network, under uncertainties attributed to airspace congestion, changing weather conditions, and varying demands. This paper presents a comprehensive optimization formulation of the fleet scheduling problem, while also identifying the need for alternate solution approaches, since directly solving the resulting integer nonlinear programming problem is computationally prohibitive for daily fleet scheduling. Previous work has shown the effectiveness of using (graph) reinforcement learning (RL) approaches to train real-time executable policy models for fleet scheduling. However, such policies can often be brittle on out-of-distribution scenarios or edge cases. Moreover, training performance also deteriorates as the complexity (e.g., number of constraints) of the problem increases. To address these issues, this paper presents an imitation learning approach where the RL-based policy exploits expert demonstrations yielded by solving the exact optimization using a Genetic Algorithm. The policy model comprises Graph Neural Network (GNN) based encoders that embed the space of vertiports and aircraft, Transformer networks to encode demand, passenger fare, and transport cost profiles, and a Multi-head attention (MHA) based decoder. Expert demonstrations are used through the Generative Adversarial Imitation Learning (GAIL) algorithm. Interfaced with a UAM simulation environment involving 8 vertiports and 40 aircrafts, in terms of the daily profits earned reward, the new imitative approach achieves better mean performance and remarkable improvement in the case of unseen worst-case scenarios, compared to pure RL results. △ Less

Submitted 16 July, 2024; originally announced July 2024.

Comments: Accepted for presentation at the AIAA Aviation Forum 2024

arXiv:2407.06096 [pdf, other]

Muzzle-Based Cattle Identification System Using Artificial Intelligence (AI)

Authors: Hasan Zohirul Islam, Safayet Khan, Sanjib Kumar Paul, Sheikh Imtiaz Rahi, Fahim Hossain Sifat, Md. Mahadi Hasan Sany, Md. Shahjahan Ali Sarker, Tareq Anam, Ismail Hossain Polas

Abstract: Absence of tamper-proof cattle identification technology was a significant problem preventing insurance companies from providing livestock insurance. This lack of technology had devastating financial consequences for marginal farmers as they did not have the opportunity to claim compensation for any unexpected events such as the accidental death of cattle in Bangladesh. Using machine learning and… ▽ More Absence of tamper-proof cattle identification technology was a significant problem preventing insurance companies from providing livestock insurance. This lack of technology had devastating financial consequences for marginal farmers as they did not have the opportunity to claim compensation for any unexpected events such as the accidental death of cattle in Bangladesh. Using machine learning and deep learning algorithms, we have solved the bottleneck of cattle identification by developing and introducing a muzzle-based cattle identification system. The uniqueness of cattle muzzles has been scientifically established, which resembles human fingerprints. This is the fundamental premise that prompted us to develop a cattle identification system that extracts the uniqueness of cattle muzzles. For this purpose, we collected 32,374 images from 826 cattle. Contrast-limited adaptive histogram equalization (CLAHE) with sharpening filters was applied in the preprocessing steps to remove noise from images. We used the YOLO algorithm for cattle muzzle detection in the image and the FaceNet architecture to learn unified embeddings from muzzle images using squared $L_2$ distances. Our system performs with an accuracy of $96.489\%$, $F_1$ score of $97.334\%$, and a true positive rate (tpr) of $87.993\%$ at a remarkably low false positive rate (fpr) of $0.098\%$. This reliable and efficient system for identifying cattle can significantly advance livestock insurance and precision farming. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.05399 [pdf, other]

IL-TUR: Benchmark for Indian Legal Text Understanding and Reasoning

Authors: Abhinav Joshi, Shounak Paul, Akshat Sharma, Pawan Goyal, Saptarshi Ghosh, Ashutosh Modi

Abstract: Legal systems worldwide are inundated with exponential growth in cases and documents. There is an imminent need to develop NLP and ML techniques for automatically processing and understanding legal documents to streamline the legal system. However, evaluating and comparing various NLP models designed specifically for the legal domain is challenging. This paper addresses this challenge by proposing… ▽ More Legal systems worldwide are inundated with exponential growth in cases and documents. There is an imminent need to develop NLP and ML techniques for automatically processing and understanding legal documents to streamline the legal system. However, evaluating and comparing various NLP models designed specifically for the legal domain is challenging. This paper addresses this challenge by proposing IL-TUR: Benchmark for Indian Legal Text Understanding and Reasoning. IL-TUR contains monolingual (English, Hindi) and multi-lingual (9 Indian languages) domain-specific tasks that address different aspects of the legal system from the point of view of understanding and reasoning over Indian legal documents. We present baseline models (including LLM-based) for each task, outlining the gap between models and the ground truth. To foster further research in the legal domain, we create a leaderboard (available at: https://exploration-lab.github.io/IL-TUR/) where the research community can upload and compare legal text understanding systems. △ Less

Submitted 7 July, 2024; originally announced July 2024.

Comments: Accepted at ACL 2024 Main Conference; 40 Pages (9 Pages + References + Appendix)

arXiv:2407.04589 [pdf, other]

Remembering Everything Makes You Vulnerable: A Limelight on Machine Unlearning for Personalized Healthcare Sector

Authors: Ahan Chatterjee, Sai Anirudh Aryasomayajula, Rajat Chaudhari, Subhajit Paul, Vishwa Mohan Singh

Abstract: As the prevalence of data-driven technologies in healthcare continues to rise, concerns regarding data privacy and security become increasingly paramount. This thesis aims to address the vulnerability of personalized healthcare models, particularly in the context of ECG monitoring, to adversarial attacks that compromise patient privacy. We propose an approach termed "Machine Unlearning" to mitigat… ▽ More As the prevalence of data-driven technologies in healthcare continues to rise, concerns regarding data privacy and security become increasingly paramount. This thesis aims to address the vulnerability of personalized healthcare models, particularly in the context of ECG monitoring, to adversarial attacks that compromise patient privacy. We propose an approach termed "Machine Unlearning" to mitigate the impact of exposed data points on machine learning models, thereby enhancing model robustness against adversarial attacks while preserving individual privacy. Specifically, we investigate the efficacy of Machine Unlearning in the context of personalized ECG monitoring, utilizing a dataset of clinical ECG recordings. Our methodology involves training a deep neural classifier on ECG data and fine-tuning the model for individual patients. We demonstrate the susceptibility of fine-tuned models to adversarial attacks, such as the Fast Gradient Sign Method (FGSM), which can exploit additional data points in personalized models. To address this vulnerability, we propose a Machine Unlearning algorithm that selectively removes sensitive data points from fine-tuned models, effectively enhancing model resilience against adversarial manipulation. Experimental results demonstrate the effectiveness of our approach in mitigating the impact of adversarial attacks while maintaining the pre-trained model accuracy. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 15 Pages, Exploring unlearning techniques on ECG Classifier

arXiv:2406.16612 [pdf, other]

Towards Physically Talented Aerial Robots with Tactically Smart Swarm Behavior thereof: An Efficient Co-design Approach

Authors: Prajit KrisshnaKumar, Steve Paul, Hemanth Manjunatha, Mary Corra, Ehsan Esfahani, Souma Chowdhury

Abstract: The collective performance or capacity of collaborative autonomous systems such as a swarm of robots is jointly influenced by the morphology and the behavior of individual systems in that collective. In that context, this paper explores how morphology impacts the learned tactical behavior of unmanned aerial/ground robots performing reconnaissance and search & rescue. This is achieved by presenting… ▽ More The collective performance or capacity of collaborative autonomous systems such as a swarm of robots is jointly influenced by the morphology and the behavior of individual systems in that collective. In that context, this paper explores how morphology impacts the learned tactical behavior of unmanned aerial/ground robots performing reconnaissance and search & rescue. This is achieved by presenting a computationally efficient framework to solve this otherwise challenging problem of jointly optimizing the morphology and tactical behavior of swarm robots. Key novel developments to this end include the use of physical talent metrics and modification of graph reinforcement learning architectures to allow joint learning of the swarm tactical policy and the talent metrics (search speed, flight range, and cruising speed) that constrain mobility and object/victim search capabilities of the aerial robots executing these tactics. Implementation of this co-design approach is supported by advancements to an open-source Pybullet-based swarm simulator that allows the use of variable aerial asset capabilities. The results of the co-design are observed to outperform those of tactics learning with a fixed Pareto design, when compared in terms of mission performance metrics. Significant differences in morphology and learned behavior are also observed by comparing the baseline design and the co-design outcomes. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted for presentation in proceedings of ASME IDETC-CIE 2024

arXiv:2406.10328 [pdf, other]

From Pixels to Prose: A Large Dataset of Dense Image Captions

Authors: Vasu Singla, Kaiyu Yue, Sukriti Paul, Reza Shirkavand, Mayuka Jayawardhana, Alireza Ganjdanesh, Heng Huang, Abhinav Bhatele, Gowthami Somepalli, Tom Goldstein

Abstract: Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure d… ▽ More Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: pixelprose 16M dataset

arXiv:2406.06424 [pdf, other]

Margin-aware Preference Optimization for Aligning Diffusion Models without Reference

Authors: Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, Jongheon Jeong

Abstract: Modern alignment techniques based on human preferences, such as RLHF and DPO, typically employ divergence regularization relative to the reference model to ensure training stability. However, this often limits the flexibility of models during alignment, especially when there is a clear distributional discrepancy between the preference data and the reference model. In this paper, we focus on the al… ▽ More Modern alignment techniques based on human preferences, such as RLHF and DPO, typically employ divergence regularization relative to the reference model to ensure training stability. However, this often limits the flexibility of models during alignment, especially when there is a clear distributional discrepancy between the preference data and the reference model. In this paper, we focus on the alignment of recent text-to-image diffusion models, such as Stable Diffusion XL (SDXL), and find that this "reference mismatch" is indeed a significant problem in aligning these models due to the unstructured nature of visual modalities: e.g., a preference for a particular stylistic aspect can easily induce such a discrepancy. Motivated by this observation, we propose a novel and memory-friendly preference alignment method for diffusion models that does not depend on any reference model, coined margin-aware preference optimization (MaPO). MaPO jointly maximizes the likelihood margin between the preferred and dispreferred image sets and the likelihood of the preferred sets, simultaneously learning general stylistic features and preferences. For evaluation, we introduce two new pairwise preference datasets, which comprise self-generated image pairs from SDXL, Pick-Style and Pick-Safety, simulating diverse scenarios of reference mismatch. Our experiments validate that MaPO can significantly improve alignment on Pick-Style and Pick-Safety and general preference alignment when used with Pick-a-Pic v2, surpassing the base SDXL and other existing methods. Our code, models, and datasets are publicly available via https://mapo-t2i.github.io △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Preprint

arXiv:2406.00375 [pdf, other]

Teledrive: An Embodied AI based Telepresence System

Authors: Snehasis Banerjee, Sayan Paul, Ruddradev Roychoudhury, Abhijan Bhattacharya, Chayan Sarkar, Ashis Sau, Pradip Pramanick, Brojeshwar Bhowmick

Abstract: This article presents Teledrive, a telepresence robotic system with embodied AI features that empowers an operator to navigate the telerobot in any unknown remote place with minimal human intervention. We conceive Teledrive in the context of democratizing remote care-giving for elderly citizens as well as for isolated patients, affected by contagious diseases. In particular, this paper focuses on… ▽ More This article presents Teledrive, a telepresence robotic system with embodied AI features that empowers an operator to navigate the telerobot in any unknown remote place with minimal human intervention. We conceive Teledrive in the context of democratizing remote care-giving for elderly citizens as well as for isolated patients, affected by contagious diseases. In particular, this paper focuses on the problem of navigating to a rough target area (like bedroom or kitchen) rather than pre-specified point destinations. This ushers in a unique AreaGoal based navigation feature, which has not been explored in depth in the contemporary solutions. Further, we describe an edge computing-based software system built on a WebRTC-based communication framework to realize the aforementioned scheme through an easy-to-use speech-based human-robot interaction. Moreover, to enhance the ease of operation for the remote caregiver, we incorporate a person following feature, whereby a robot follows a person on the move in its premises as directed by the operator. Moreover, the system presented is loosely coupled with specific robot hardware, unlike the existing solutions. We have evaluated the efficacy of the proposed system through baseline experiments, user study, and real-life deployment. △ Less

Submitted 1 June, 2024; originally announced June 2024.

Comments: Accepted in Journal of Intelligent Robotic System

Journal ref: Journal of Intelligent Robotic System 2024

arXiv:2405.16517 [pdf, other]

Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors

Authors: Soumava Paul, Christopher Wewer, Bernt Schiele, Jan Eric Lenssen

Abstract: We aim to tackle sparse-view reconstruction of a 360 3D scene using priors from latent diffusion models (LDM). The sparse-view setting is ill-posed and underconstrained, especially for scenes where the camera rotates 360 degrees around a point, as no visual information is available beyond some frontal views focused on the central object(s) of interest. In this work, we show that pretrained 2D diff… ▽ More We aim to tackle sparse-view reconstruction of a 360 3D scene using priors from latent diffusion models (LDM). The sparse-view setting is ill-posed and underconstrained, especially for scenes where the camera rotates 360 degrees around a point, as no visual information is available beyond some frontal views focused on the central object(s) of interest. In this work, we show that pretrained 2D diffusion models can strongly improve the reconstruction of a scene with low-cost fine-tuning. Specifically, we present SparseSplat360 (Sp2360), a method that employs a cascade of in-painting and artifact removal models to fill in missing details and clean novel views. Due to superior training and rendering speeds, we use an explicit scene representation in the form of 3D Gaussians over NeRF-based implicit representations. We propose an iterative update strategy to fuse generated pseudo novel views with existing 3D Gaussians fitted to the initial sparse inputs. As a result, we obtain a multi-view consistent scene representation with details coherent with the observed inputs. Our evaluation on the challenging Mip-NeRF360 dataset shows that our proposed 2D to 3D distillation algorithm considerably improves the performance of a regularized version of 3DGS adapted to a sparse-view setting and outperforms existing sparse-view reconstruction methods in 360 scene reconstruction. Qualitatively, our method generates entire 360 scenes from as few as 9 input views, with a high degree of foreground and background detail. △ Less

Submitted 2 June, 2024; v1 submitted 26 May, 2024; originally announced May 2024.

Comments: 18 pages, 11 figures, 4 tables

arXiv:2405.01421 [pdf, ps, other]

Systematic Construction of Golay Complementary Sets of Arbitrary Lengths and Alphabet Sizes

Authors: Abhishek Roy, Sudhan Majhi, Subhabrata Paul

Abstract: One of the important applications of Golay complementary sets (GCSs) is the reduction of peak-to-mean envelope power ratio (PMEPR) in orthogonal frequency division multiplexing (OFDM) systems. OFDM has played a major role in modern wireless systems such as long-term-evolution (LTE), 5th generation (5G) wireless standards, etc. This paper searches for systematic constructions of GCSs of arbitrary l… ▽ More One of the important applications of Golay complementary sets (GCSs) is the reduction of peak-to-mean envelope power ratio (PMEPR) in orthogonal frequency division multiplexing (OFDM) systems. OFDM has played a major role in modern wireless systems such as long-term-evolution (LTE), 5th generation (5G) wireless standards, etc. This paper searches for systematic constructions of GCSs of arbitrary lengths and alphabet sizes. The proposed constructions are based on extended Boolean functions (EBFs). For the first time, we can generate codes of independent parameter choices. △ Less

Submitted 8 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

MSC Class: 94A55; 94A15; 94D10

arXiv:2404.02447 [pdf]

A Novel Approach to Breast Cancer Histopathological Image Classification Using Cross-Colour Space Feature Fusion and Quantum-Classical Stack Ensemble Method

Authors: Sambit Mallick, Snigdha Paul, Anindya Sen

Abstract: Breast cancer classification stands as a pivotal pillar in ensuring timely diagnosis and effective treatment. This study with histopathological images underscores the profound significance of harnessing the synergistic capabilities of colour space ensembling and quantum-classical stacking to elevate the precision of breast cancer classification. By delving into the distinct colour spaces of RGB, H… ▽ More Breast cancer classification stands as a pivotal pillar in ensuring timely diagnosis and effective treatment. This study with histopathological images underscores the profound significance of harnessing the synergistic capabilities of colour space ensembling and quantum-classical stacking to elevate the precision of breast cancer classification. By delving into the distinct colour spaces of RGB, HSV and CIE L*u*v, the authors initiated a comprehensive investigation guided by advanced methodologies. Employing the DenseNet121 architecture for feature extraction the authors have capitalized on the robustness of Random Forest, SVM, QSVC, and VQC classifiers. This research encompasses a unique feature fusion technique within the colour space ensemble. This approach not only deepens our comprehension of breast cancer classification but also marks a milestone in personalized medical assessment. The amalgamation of quantum and classical classifiers through stacking emerges as a potent catalyst, effectively mitigating the inherent constraints of individual classifiers, paving a robust path towards more dependable and refined breast cancer identification. Through rigorous experimentation and meticulous analysis, fusion of colour spaces like RGB with HSV and RGB with CIE L*u*v, presents an classification accuracy, nearing the value of unity. This underscores the transformative potential of our approach, where the fusion of diverse colour spaces and the synergy of quantum and classical realms converge to establish a new horizon in medical diagnostics. Thus the implications of this research extend across medical disciplines, offering promising avenues for advancing diagnostic accuracy and treatment efficacy. △ Less

Submitted 3 April, 2024; originally announced April 2024.

arXiv:2404.01197 [pdf, other]

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Authors: Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang

Abstract: One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language… ▽ More One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also developing datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language datasets do not represent spatial relationships well enough; to alleviate this bottleneck, we create SPRIGHT, the first spatially-focused, large scale dataset, by re-captioning 6 million images from 4 widely used vision datasets. Through a 3-fold evaluation and analysis pipeline, we find that SPRIGHT largely improves upon existing datasets in capturing spatial relationships. To demonstrate its efficacy, we leverage only ~0.25% of SPRIGHT and achieve a 22% improvement in generating spatially accurate images while also improving the FID and CMMD scores. Secondly, we find that training on images containing a large number of objects results in substantial improvements in spatial consistency. Notably, we attain state-of-the-art on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on <500 images. Finally, through a set of controlled experiments and ablations, we document multiple findings that we believe will enhance the understanding of factors that affect spatial consistency in text-to-image models. We publicly release our dataset and model to foster further research in this area. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: project webpage : https://spright-t2i.github.io/

arXiv:2403.14687 [pdf, other]

On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets

Authors: Luke Oluwaseye Joel, Wesley Doorsamy, Babu Sena Paul

Abstract: Missing values or data is one popular characteristic of real-world datasets, especially healthcare data. This could be frustrating when using machine learning algorithms on such datasets, simply because most machine learning models perform poorly in the presence of missing values. The aim of this study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imp… ▽ More Missing values or data is one popular characteristic of real-world datasets, especially healthcare data. This could be frustrating when using machine learning algorithms on such datasets, simply because most machine learning models perform poorly in the presence of missing values. The aim of this study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imputation, Last Observation carried Forward (LOCF) imputation, K-Nearest Neighbor (KNN) imputation, Interpolation imputation, Missforest imputation, and Multiple imputation by Chained Equations (MICE), on three healthcare datasets. Some percentage of missing values - 10\%, 15\%, 20\% and 25\% - were introduced into the dataset, and the imputation techniques were employed to impute these missing values. The comparison of their performance was evaluated by using root mean squared error (RMSE) and mean absolute error (MAE). The results show that Missforest imputation performs the best followed by MICE imputation. Additionally, we try to determine whether it is better to perform feature selection before imputation or vice versa by using the following metrics - the recall, precision, f1-score and accuracy. Due to the fact that there are few literature on this and some debate on the subject among researchers, we hope that the results from this experiment will encourage data scientists and researchers to perform imputation first before feature selection when dealing with data containing missing values. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.07131 [pdf, other]

Bigraph Matching Weighted with Learnt Incentive Function for Multi-Robot Task Allocation

Authors: Steve Paul, Nathan Maurer, Souma Chowdhury

Abstract: Most real-world Multi-Robot Task Allocation (MRTA) problems require fast and efficient decision-making, which is often achieved using heuristics-aided methods such as genetic algorithms, auction-based methods, and bipartite graph matching methods. These methods often assume a form that lends better explainability compared to an end-to-end (learnt) neural network based policy for MRTA. However, der… ▽ More Most real-world Multi-Robot Task Allocation (MRTA) problems require fast and efficient decision-making, which is often achieved using heuristics-aided methods such as genetic algorithms, auction-based methods, and bipartite graph matching methods. These methods often assume a form that lends better explainability compared to an end-to-end (learnt) neural network based policy for MRTA. However, deriving suitable heuristics can be tedious, risky and in some cases impractical if problems are too complex. This raises the question: can these heuristics be learned? To this end, this paper particularly develops a Graph Reinforcement Learning (GRL) framework to learn the heuristics or incentives for a bipartite graph matching approach to MRTA. Specifically a Capsule Attention policy model is used to learn how to weight task/robot pairings (edges) in the bipartite graph that connects the set of tasks to the set of robots. The original capsule attention network architecture is fundamentally modified by adding encoding of robots' state graph, and two Multihead Attention based decoders whose output are used to construct a LogNormal distribution matrix from which positive bigraph weights can be drawn. The performance of this new bigraph matching approach augmented with a GRL-derived incentive is found to be at par with the original bigraph matching approach that used expert-specified heuristics, with the former offering notable robustness benefits. During training, the learned incentive policy is found to get initially closer to the expert-specified incentive and then slightly deviate from its trend. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: This paper was accepted for presentation in proceedings of IEEE International Conference on Robotics and Automation 2024

arXiv:2403.04962 [pdf, other]

C2P-GCN: Cell-to-Patch Graph Convolutional Network for Colorectal Cancer Grading

Authors: Sudipta Paul, Bulent Yener, Amanda W. Lund

Abstract: Graph-based learning approaches, due to their ability to encode tissue/organ structure information, are increasingly favored for grading colorectal cancer histology images. Recent graph-based techniques involve dividing whole slide images (WSIs) into smaller or medium-sized patches, and then building graphs on each patch for direct use in training. This method, however, fails to capture the tissue… ▽ More Graph-based learning approaches, due to their ability to encode tissue/organ structure information, are increasingly favored for grading colorectal cancer histology images. Recent graph-based techniques involve dividing whole slide images (WSIs) into smaller or medium-sized patches, and then building graphs on each patch for direct use in training. This method, however, fails to capture the tissue structure information present in an entire WSI and relies on training from a significantly large dataset of image patches. In this paper, we propose a novel cell-to-patch graph convolutional network (C2P-GCN), which is a two-stage graph formation-based approach. In the first stage, it forms a patch-level graph based on the cell organization on each patch of a WSI. In the second stage, it forms an image-level graph based on a similarity measure between patches of a WSI considering each patch as a node of a graph. This graph representation is then fed into a multi-layer GCN-based classification network. Our approach, through its dual-phase graph construction, effectively gathers local structural details from individual patches and establishes a meaningful connection among all patches across a WSI. As C2P-GCN integrates the structural data of an entire WSI into a single graph, it allows our model to work with significantly fewer training data compared to the latest models for colorectal cancer. Experimental validation of C2P-GCN on two distinct colorectal cancer datasets demonstrates the effectiveness of our method. △ Less

Submitted 13 May, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

Comments: Accepted at IEEE EMBC 2024

arXiv:2403.04537 [pdf]

VLSI Architectures of Forward Kinematic Processor for Robotics Applications

Authors: Sourav Roy, Subhadeep Paul, Tapas Kumar Maiti

Abstract: This paper aims to get a comprehensive review of current-day robotic computation technologies at VLSI architecture level. We studied several repots in the domain of robotic processor architecture. In this work, we focused on the forward kinematics architectures which consider CORDIC algorithms, VLSI circuits of WE DSP16 chip, parallel processing and pipelined architecture, and lookup table formula… ▽ More This paper aims to get a comprehensive review of current-day robotic computation technologies at VLSI architecture level. We studied several repots in the domain of robotic processor architecture. In this work, we focused on the forward kinematics architectures which consider CORDIC algorithms, VLSI circuits of WE DSP16 chip, parallel processing and pipelined architecture, and lookup table formula and FPGA processor. This study gives us an understanding of different implementation methods for forward kinematics. Our goal is to develop a forward kinematics processor with FPGA for real-time applications, requires a fast response time and low latency of these devices, useful for industrial automation where the processing speed plays a great role. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: 8 pages, 22 figures

arXiv:2402.17412 [pdf, other]

DiffuseKronA: A Parameter Efficient Fine-tuning Method for Personalized Diffusion Models

Authors: Shyam Marjit, Harshit Singh, Nityanand Mathur, Sayak Paul, Chia-Mu Yu, Pin-Yu Chen

Abstract: In the realm of subject-driven text-to-image (T2I) generative models, recent developments like DreamBooth and BLIP-Diffusion have led to impressive results yet encounter limitations due to their intensive fine-tuning demands and substantial parameter requirements. While the low-rank adaptation (LoRA) module within DreamBooth offers a reduction in trainable parameters, it introduces a pronounced se… ▽ More In the realm of subject-driven text-to-image (T2I) generative models, recent developments like DreamBooth and BLIP-Diffusion have led to impressive results yet encounter limitations due to their intensive fine-tuning demands and substantial parameter requirements. While the low-rank adaptation (LoRA) module within DreamBooth offers a reduction in trainable parameters, it introduces a pronounced sensitivity to hyperparameters, leading to a compromise between parameter efficiency and the quality of T2I personalized image synthesis. Addressing these constraints, we introduce \textbf{\textit{DiffuseKronA}}, a novel Kronecker product-based adaptation module that not only significantly reduces the parameter count by 35\% and 99.947\% compared to LoRA-DreamBooth and the original DreamBooth, respectively, but also enhances the quality of image synthesis. Crucially, \textit{DiffuseKronA} mitigates the issue of hyperparameter sensitivity, delivering consistent high-quality generations across a wide range of hyperparameters, thereby diminishing the necessity for extensive fine-tuning. Furthermore, a more controllable decomposition makes \textit{DiffuseKronA} more interpretable and even can achieve up to a 50\% reduction with results comparable to LoRA-Dreambooth. Evaluated against diverse and complex input images and text prompts, \textit{DiffuseKronA} consistently outperforms existing models, producing diverse images of higher quality with improved fidelity and a more accurate color distribution of objects, all the while upholding exceptional parameter efficiency, thus presenting a substantial advancement in the field of T2I generative modeling. Our project page, consisting of links to the code, and pre-trained checkpoints, is available at https://diffusekrona.github.io/. △ Less

Submitted 28 February, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: Project Page: https://diffusekrona.github.io/

arXiv:2402.09757 [pdf, ps, other]

Construction of CCC and ZCCS Through Additive Characters Over Galois Field

Authors: Gobinda Ghosh, Sudhan Majhi, Subhabrata Paul

Abstract: The rapid progression in wireless communication technologies, especially in multicarrier code-division multiple access (MC-CDMA), there is a need of advanced code construction methods. Traditional approaches, mainly based on generalized Boolean functions, have limitations in code length versatility. This paper introduces a novel approach to constructing complete complementary codes (CCC) and Z-com… ▽ More The rapid progression in wireless communication technologies, especially in multicarrier code-division multiple access (MC-CDMA), there is a need of advanced code construction methods. Traditional approaches, mainly based on generalized Boolean functions, have limitations in code length versatility. This paper introduces a novel approach to constructing complete complementary codes (CCC) and Z-complementary code sets (ZCCS), for reducing interference in MC-CDMA systems. The proposed construction, distinct from Boolean function-based approaches, employs additive characters over Galois fields GF($p^{r}$), where $p$ is prime and $r$ is a positive integer. First, we develop CCCs with lengths of $p^{r}$, which are then extended to construct ZCCS with both unreported lengths and sizes of $np^{r}$, where $n$ are arbitrary positive integers. The versatility of this method is further highlighted as it includes the lengths of ZCCS reported in prior studies as special cases, underscoring the method's comprehensive nature and superiority. △ Less

Submitted 18 March, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

arXiv:2402.04814 [pdf, other]

BOWLL: A Deceptively Simple Open World Lifelong Learner

Authors: Roshni Kamath, Rupert Mitchell, Subarnaduti Paul, Kristian Kersting, Martin Mundt

Abstract: The quest to improve scalar performance numbers on predetermined benchmarks seems to be deeply engraved in deep learning. However, the real world is seldom carefully curated and applications are seldom limited to excelling on test sets. A practical system is generally required to recognize novel concepts, refrain from actively including uninformative data, and retain previously acquired knowledge… ▽ More The quest to improve scalar performance numbers on predetermined benchmarks seems to be deeply engraved in deep learning. However, the real world is seldom carefully curated and applications are seldom limited to excelling on test sets. A practical system is generally required to recognize novel concepts, refrain from actively including uninformative data, and retain previously acquired knowledge throughout its lifetime. Despite these key elements being rigorously researched individually, the study of their conjunction, open world lifelong learning, is only a recent trend. To accelerate this multifaceted field's exploration, we introduce its first monolithic and much-needed baseline. Leveraging the ubiquitous use of batch normalization across deep neural networks, we propose a deceptively simple yet highly effective way to repurpose standard models for open world lifelong learning. Through extensive empirical evaluation, we highlight why our approach should serve as a future standard for models that are able to effectively maintain their knowledge, selectively focus on informative data, and accelerate future learning. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.03388 [pdf, other]

doi 10.1145/3583780.3614839

Delivery Optimized Discovery in Behavioral User Segmentation under Budget Constraint

Authors: Harshita Chopra, Atanu R. Sinha, Sunav Choudhary, Ryan A. Rossi, Paavan Kumar Indela, Veda Pranav Parwatala, Srinjayee Paul, Aurghya Maiti

Abstract: Users' behavioral footprints online enable firms to discover behavior-based user segments (or, segments) and deliver segment specific messages to users. Following the discovery of segments, delivery of messages to users through preferred media channels like Facebook and Google can be challenging, as only a portion of users in a behavior segment find match in a medium, and only a fraction of those… ▽ More Users' behavioral footprints online enable firms to discover behavior-based user segments (or, segments) and deliver segment specific messages to users. Following the discovery of segments, delivery of messages to users through preferred media channels like Facebook and Google can be challenging, as only a portion of users in a behavior segment find match in a medium, and only a fraction of those matched actually see the message (exposure). Even high quality discovery becomes futile when delivery fails. Many sophisticated algorithms exist for discovering behavioral segments; however, these ignore the delivery component. The problem is compounded because (i) the discovery is performed on the behavior data space in firms' data (e.g., user clicks), while the delivery is predicated on the static data space (e.g., geo, age) as defined by media; and (ii) firms work under budget constraint. We introduce a stochastic optimization based algorithm for delivery optimized discovery of behavioral user segmentation and offer new metrics to address the joint optimization. We leverage optimization under a budget constraint for delivery combined with a learning-based component for discovery. Extensive experiments on a public dataset from Google and a proprietary dataset show the effectiveness of our approach by simultaneously improving delivery metrics, reducing budget spend and achieving strong predictive performance in discovery. △ Less

Submitted 15 March, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

arXiv:2402.00637 [pdf, other]

Fisheye Camera and Ultrasonic Sensor Fusion For Near-Field Obstacle Perception in Bird's-Eye-View

Authors: Arindam Das, Sudarshan Paul, Niko Scholz, Akhilesh Kumar Malviya, Ganesh Sistu, Ujjwal Bhattacharya, Ciarán Eising

Abstract: Accurate obstacle identification represents a fundamental challenge within the scope of near-field perception for autonomous driving. Conventionally, fisheye cameras are frequently employed for comprehensive surround-view perception, including rear-view obstacle localization. However, the performance of such cameras can significantly deteriorate in low-light conditions, during nighttime, or when s… ▽ More Accurate obstacle identification represents a fundamental challenge within the scope of near-field perception for autonomous driving. Conventionally, fisheye cameras are frequently employed for comprehensive surround-view perception, including rear-view obstacle localization. However, the performance of such cameras can significantly deteriorate in low-light conditions, during nighttime, or when subjected to intense sun glare. Conversely, cost-effective sensors like ultrasonic sensors remain largely unaffected under these conditions. Therefore, we present, to our knowledge, the first end-to-end multimodal fusion model tailored for efficient obstacle perception in a bird's-eye-view (BEV) perspective, utilizing fisheye cameras and ultrasonic sensors. Initially, ResNeXt-50 is employed as a set of unimodal encoders to extract features specific to each modality. Subsequently, the feature space associated with the visible spectrum undergoes transformation into BEV. The fusion of these two modalities is facilitated via concatenation. At the same time, the ultrasonic spectrum-based unimodal feature maps pass through content-aware dilated convolution, applied to mitigate the sensor misalignment between two sensors in the fused feature space. Finally, the fused features are utilized by a two-stage semantic occupancy decoder to generate grid-wise predictions for precise obstacle perception. We conduct a systematic investigation to determine the optimal strategy for multimodal fusion of both sensors. We provide insights into our dataset creation procedures, annotation guidelines, and perform a thorough data analysis to ensure adequate coverage of all scenarios. When applied to our dataset, the experimental results underscore the robustness and effectiveness of our proposed multimodal fusion approach. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Comments: 16 pages, 12 Figures, 6 tables

arXiv:2401.05252 [pdf, other]

PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models

Authors: Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, Zhenguo Li

Abstract: This technical report introduces PIXART-δ, a text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-α model. PIXART-α is recognized for its ability to generate high-quality images of 1024px resolution through a remarkably efficient training process. The integration of LCM in PIXART-δ significantly accelerates the inference speed… ▽ More This technical report introduces PIXART-δ, a text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-α model. PIXART-α is recognized for its ability to generate high-quality images of 1024px resolution through a remarkably efficient training process. The integration of LCM in PIXART-δ significantly accelerates the inference speed, enabling the production of high-quality images in just 2-4 steps. Notably, PIXART-δ achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images, marking a 7x improvement over the PIXART-α. Additionally, PIXART-δ is designed to be efficiently trainable on 32GB V100 GPUs within a single day. With its 8-bit inference capability (von Platen et al., 2023), PIXART-δ can synthesize 1024px images within 8GB GPU memory constraints, greatly enhancing its usability and accessibility. Furthermore, incorporating a ControlNet-like module enables fine-grained control over text-to-image diffusion models. We introduce a novel ControlNet-Transformer architecture, specifically tailored for Transformers, achieving explicit controllability alongside high-quality image generation. As a state-of-the-art, open-source image generation model, PIXART-δ offers a promising alternative to the Stable Diffusion family of models, contributing significantly to text-to-image synthesis. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: Technical Report

arXiv:2401.04851 [pdf, other]

Graph Learning-based Fleet Scheduling for Urban Air Mobility under Operational Constraints, Varying Demand & Uncertainties

Authors: Steve Paul, Jhoel Witter, Souma Chowdhury

Abstract: This paper develops a graph reinforcement learning approach to online planning of the schedule and destinations of electric aircraft that comprise an urban air mobility (UAM) fleet operating across multiple vertiports. This fleet scheduling problem is formulated to consider time-varying demand, constraints related to vertiport capacity, aircraft capacity and airspace safety guidelines, uncertainti… ▽ More This paper develops a graph reinforcement learning approach to online planning of the schedule and destinations of electric aircraft that comprise an urban air mobility (UAM) fleet operating across multiple vertiports. This fleet scheduling problem is formulated to consider time-varying demand, constraints related to vertiport capacity, aircraft capacity and airspace safety guidelines, uncertainties related to take-off delay, weather-induced route closures, and unanticipated aircraft downtime. Collectively, such a formulation presents greater complexity, and potentially increased realism, than in existing UAM fleet planning implementations. To address these complexities, a new policy architecture is constructed, primary components of which include: graph capsule conv-nets for encoding vertiport and aircraft-fleet states both abstracted as graphs; transformer layers encoding time series information on demand and passenger fare; and a Multi-head Attention-based decoder that uses the encoded information to compute the probability of selecting each available destination for an aircraft. Trained with Proximal Policy Optimization, this policy architecture shows significantly better performance in terms of daily averaged profits on unseen test scenarios involving 8 vertiports and 40 aircraft, when compared to a random baseline and genetic algorithm-derived optimal solutions, while being nearly 1000 times faster in execution than the latter. △ Less

Submitted 9 January, 2024; originally announced January 2024.

Comments: This paper is accepted to be presented at the ACM Symposium on Applied Computing 2024

arXiv:2401.02677 [pdf, other]

Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss

Authors: Yatharth Gupta, Vishnu V. Jaddipal, Harish Prabhala, Sayak Paul, Patrick Von Platen

Abstract: Stable Diffusion XL (SDXL) has become the best open source text-to-image model (T2I) for its versatility and top-notch image quality. Efficiently addressing the computational demands of SDXL models is crucial for wider reach and applicability. In this work, we introduce two scaled-down variants, Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter UNets, respectively,… ▽ More Stable Diffusion XL (SDXL) has become the best open source text-to-image model (T2I) for its versatility and top-notch image quality. Efficiently addressing the computational demands of SDXL models is crucial for wider reach and applicability. In this work, we introduce two scaled-down variants, Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter UNets, respectively, achieved through progressive removal using layer-level losses focusing on reducing the model size while preserving generative quality. We release these models weights at https://hf.co/Segmind. Our methodology involves the elimination of residual networks and transformer blocks from the U-Net structure of SDXL, resulting in significant reductions in parameters, and latency. Our compact models effectively emulate the original SDXL by capitalizing on transferred knowledge, achieving competitive results against larger multi-billion parameter SDXL. Our work underscores the efficacy of knowledge distillation coupled with layer-level losses in reducing model size while preserving the high-quality generative capabilities of SDXL, thus facilitating more accessible deployment in resource-constrained environments. △ Less

Submitted 5 January, 2024; originally announced January 2024.

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.07368 [pdf]

Sequential Planning in Large Partially Observable Environments guided by LLMs

Authors: Swarna Kamal Paul

Abstract: Sequential planning in large state space and action space quickly becomes intractable due to combinatorial explosion of the search space. Heuristic methods, like monte-carlo tree search, though effective for large state space, but struggle if action space is large. Pure reinforcement learning methods, relying only on reward signals, needs prohibitively large interactions with the environment to de… ▽ More Sequential planning in large state space and action space quickly becomes intractable due to combinatorial explosion of the search space. Heuristic methods, like monte-carlo tree search, though effective for large state space, but struggle if action space is large. Pure reinforcement learning methods, relying only on reward signals, needs prohibitively large interactions with the environment to device a viable plan. If the state space, observations and actions can be represented in natural language then Large Language models (LLM) can be used to generate action plans. Recently several such goal-directed agents like Reflexion, CLIN, SayCan were able to surpass the performance of other state-of-the-art methods with minimum or no task specific training. But they still struggle with exploration and get stuck in local optima. Their planning capabilities are limited by the limited reasoning capability of the foundational LLMs on text data. We propose a hybrid agent "neoplanner", that synergizes both state space search with queries to foundational LLM to get the best action plan. The reward signals are quantitatively used to drive the search. A balance of exploration and exploitation is maintained by maximizing upper confidence bounds of values of states. In places where random exploration is needed, the LLM is queried to generate an action plan. Learnings from each trial are stored as entity relationships in text format. Those are used in future queries to the LLM for continual improvement. Experiments in the Scienceworld environment reveals a 124% improvement from the current best method in terms of average reward gained across multiple tasks. △ Less

Submitted 12 December, 2023; originally announced December 2023.

Comments: 8 pages, 2 figures, 1 table

arXiv:2312.02420 [pdf, other]

Towards Granularity-adjusted Pixel-level Semantic Annotation

Authors: Rohit Kundu, Sudipta Paul, Rohit Lal, Amit K. Roy-Chowdhury

Abstract: Recent advancements in computer vision predominantly rely on learning-based systems, leveraging annotations as the driving force to develop specialized models. However, annotating pixel-level information, particularly in semantic segmentation, presents a challenging and labor-intensive task, prompting the need for autonomous processes. In this work, we propose GranSAM which distinguishes itself by… ▽ More Recent advancements in computer vision predominantly rely on learning-based systems, leveraging annotations as the driving force to develop specialized models. However, annotating pixel-level information, particularly in semantic segmentation, presents a challenging and labor-intensive task, prompting the need for autonomous processes. In this work, we propose GranSAM which distinguishes itself by providing semantic segmentation at the user-defined granularity level on unlabeled data without the need for any manual supervision, offering a unique contribution in the realm of semantic mask annotation method. Specifically, we propose an approach to enable the Segment Anything Model (SAM) with semantic recognition capability to generate pixel-level annotations for images without any manual supervision. For this, we accumulate semantic information from synthetic images generated by the Stable Diffusion model or web crawled images and employ this data to learn a mapping function between SAM mask embeddings and object class labels. As a result, SAM, enabled with granularity-adjusted mask recognition, can be used for pixel-level semantic annotation purposes. We conducted experiments on the PASCAL VOC 2012 and COCO-80 datasets and observed a +17.95% and +5.17% increase in mIoU, respectively, compared to existing state-of-the-art methods when evaluated under our problem setting. △ Less

Submitted 4 December, 2023; originally announced December 2023.

arXiv:2311.17475 [pdf, other]

CLiSA: A Hierarchical Hybrid Transformer Model using Orthogonal Cross Attention for Satellite Image Cloud Segmentation

Authors: Subhajit Paul, Ashutosh Gupta

Abstract: Clouds in optical satellite images are a major concern since their presence hinders the ability to carry accurate analysis as well as processing. Presence of clouds also affects the image tasking schedule and results in wastage of valuable storage space on ground as well as space-based systems. Due to these reasons, deriving accurate cloud masks from optical remote-sensing images is an important t… ▽ More Clouds in optical satellite images are a major concern since their presence hinders the ability to carry accurate analysis as well as processing. Presence of clouds also affects the image tasking schedule and results in wastage of valuable storage space on ground as well as space-based systems. Due to these reasons, deriving accurate cloud masks from optical remote-sensing images is an important task. Traditional methods such as threshold-based, spatial filtering for cloud detection in satellite images suffer from lack of accuracy. In recent years, deep learning algorithms have emerged as a promising approach to solve image segmentation problems as it allows pixel-level classification and semantic-level segmentation. In this paper, we introduce a deep-learning model based on hybrid transformer architecture for effective cloud mask generation named CLiSA - Cloud segmentation via Lipschitz Stable Attention network. In this context, we propose an concept of orthogonal self-attention combined with hierarchical cross attention model, and we validate its Lipschitz stability theoretically and empirically. We design the whole setup under adversarial setting in presence of Lovász-Softmax loss. We demonstrate both qualitative and quantitative outcomes for multiple satellite image datasets including Landsat-8, Sentinel-2, and Cartosat-2s. Performing comparative study we show that our model performs preferably against other state-of-the-art methods and also provides better generalization in precise cloud extraction from satellite multi-spectral (MX) images. We also showcase different ablation studies to endorse our choices corresponding to different architectural elements and objective functions. △ Less

Submitted 1 December, 2023; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: 14 pages, 11 figures, 7 tables

arXiv:2311.16490 [pdf, other]

SIRAN: Sinkhorn Distance Regularized Adversarial Network for DEM Super-resolution using Discriminative Spatial Self-attention

Authors: Subhajit Paul, Ashutosh Gupta

Abstract: Digital Elevation Model (DEM) is an essential aspect in the remote sensing domain to analyze and explore different applications related to surface elevation information. In this study, we intend to address the generation of high-resolution DEMs using high-resolution multi-spectral (MX) satellite imagery by incorporating adversarial learning. To promptly regulate this process, we utilize the notion… ▽ More Digital Elevation Model (DEM) is an essential aspect in the remote sensing domain to analyze and explore different applications related to surface elevation information. In this study, we intend to address the generation of high-resolution DEMs using high-resolution multi-spectral (MX) satellite imagery by incorporating adversarial learning. To promptly regulate this process, we utilize the notion of polarized self-attention of discriminator spatial maps as well as introduce a Densely connected Multi-Residual Block (DMRB) module to assist in efficient gradient flow. Further, we present an objective function related to optimizing Sinkhorn distance with traditional GAN to improve the stability of adversarial learning. In this regard, we provide both theoretical and empirical substantiation of better performance in terms of vanishing gradient issues and numerical convergence. We demonstrate both qualitative and quantitative outcomes with available state-of-the-art methods. Based on our experiments on DEM datasets of Shuttle Radar Topographic Mission (SRTM) and Cartosat-1, we show that the proposed model performs preferably against other learning-based state-of-the-art methods. We also generate and visualize several high-resolution DEMs covering terrains with diverse signatures to show the performance of our model. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: 15 pages, 14 figures

arXiv:2311.03374 [pdf, other]

Generative AI for Software Metadata: Overview of the Information Retrieval in Software Engineering Track at FIRE 2023

Authors: Srijoni Majumdar, Soumen Paul, Debjyoti Paul, Ayan Bandyopadhyay, Samiran Chattopadhyay, Partha Pratim Das, Paul D Clough, Prasenjit Majumder

Abstract: The Information Retrieval in Software Engineering (IRSE) track aims to develop solutions for automated evaluation of code comments in a machine learning framework based on human and large language model generated labels. In this track, there is a binary classification task to classify comments as useful and not useful. The dataset consists of 9048 code comments and surrounding code snippet pairs e… ▽ More The Information Retrieval in Software Engineering (IRSE) track aims to develop solutions for automated evaluation of code comments in a machine learning framework based on human and large language model generated labels. In this track, there is a binary classification task to classify comments as useful and not useful. The dataset consists of 9048 code comments and surrounding code snippet pairs extracted from open source github C based projects and an additional dataset generated individually by teams using large language models. Overall 56 experiments have been submitted by 17 teams from various universities and software companies. The submissions have been evaluated quantitatively using the F1-Score and qualitatively based on the type of features developed, the supervised learning model used and their corresponding hyper-parameters. The labels generated from large language models increase the bias in the prediction model but lead to less over-fitted results. △ Less

Submitted 27 October, 2023; originally announced November 2023.

Comments: Overview Paper of the Information Retrieval of Software Engineering Track at the Forum for Information Retrieval, 2023

arXiv:2311.00724 [pdf]

Fraud Analytics Using Machine-learning & Engineering on Big Data (FAME) for Telecom

Authors: Sudarson Roy Pratihar, Subhadip Paul, Pranab Kumar Dash, Amartya Kumar Das

Abstract: Telecom industries lose globally 46.3 Billion USD due to fraud. Data mining and machine learning techniques (apart from rules oriented approach) have been used in past, but efficiency has been low as fraud pattern changes very rapidly. This paper presents an industrialized solution approach with self adaptive data mining technique and application of big data technologies to detect fraud and discov… ▽ More Telecom industries lose globally 46.3 Billion USD due to fraud. Data mining and machine learning techniques (apart from rules oriented approach) have been used in past, but efficiency has been low as fraud pattern changes very rapidly. This paper presents an industrialized solution approach with self adaptive data mining technique and application of big data technologies to detect fraud and discover novel fraud patterns in accurate, efficient and cost effective manner. Solution has been successfully demonstrated to detect International Revenue Share Fraud with <5% false positive. More than 1 Terra Bytes of Call Detail Record from a reputed wholesale carrier and overseas telecom transit carrier has been used to conduct this study. △ Less

Submitted 31 October, 2023; originally announced November 2023.

Comments: Presented in International Conference in Indian Institute of Management, Bangalore, India

arXiv:2310.07465 [pdf, other]

Algorithmic study on liar's vertex-edge domination problem

Authors: Debojyoti Bhattacharya, Subhabrata Paul

Abstract: Let $G=(V,E)$ be a graph. For an edge $e=xy\in E$, the closed neighbourhood of $e$, denoted by $N_G[e]$ or $N_G[xy]$, is the set $N_G[x]\cup N_G[y]$. A vertex set $L\subseteq V$ is liar's vertex-edge dominating set of a graph $G=(V,E)$ if for every $e_i\in E$, $|N_G[e_i]\cap L|\geq 2$ and for every pair of distinct edges $e_i$ and $e_j$, $|(N_G[e_i]\cup N_G[e_j])\cap L|\geq 3$. This paper introduc… ▽ More Let $G=(V,E)$ be a graph. For an edge $e=xy\in E$, the closed neighbourhood of $e$, denoted by $N_G[e]$ or $N_G[xy]$, is the set $N_G[x]\cup N_G[y]$. A vertex set $L\subseteq V$ is liar's vertex-edge dominating set of a graph $G=(V,E)$ if for every $e_i\in E$, $|N_G[e_i]\cap L|\geq 2$ and for every pair of distinct edges $e_i$ and $e_j$, $|(N_G[e_i]\cup N_G[e_j])\cap L|\geq 3$. This paper introduces the notion of liar's vertex-edge domination which arises naturally from some applications in communication networks. Given a graph $G$, the \textsc{Minimum Liar's Vertex-Edge Domination Problem} (\textsc{MinLVEDP}) asks to find a liar's vertex-edge dominating set of $G$ of minimum cardinality. In this paper, we study this problem from algorithmic point of view. We show that \textsc{MinLVEDP} can be solved in linear time for trees, whereas the decision version of this problem is NP-complete for chordal graphs, bipartite graphs, and $p$-claw free graphs for $p\geq 4$. We further study approximation algorithms for this problem. We propose two approximation algorithms for \textsc{MinLVEDP} in general graphs and $p$-claw free graphs. %We propose an $O(\ln Δ(G))$-approximation algorithm for \textsc{MinLVEDP} in general graphs, where $Δ(G)$ is the maximum degree of the input graph. Also, we design a constant factor approximation algorithm for $p$-claw free graphs. On the negative side, we show that the \textsc{MinLVEDP} cannot be approximated within $\frac{1}{2}(\frac{1}{8}-ε)\ln|V|$ for any $ε>0$, unless $NP\subseteq DTIME(|V|^{O(\log(\log|V|)})$. Finally, we prove that the \textsc{MinLVEDP} is APX-complete for bounded degree graphs and $p$-claw free graphs for $p\geq 6$. △ Less

Submitted 24 January, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

arXiv:2310.07452 [pdf, other]

On $k$-vertex-edge domination of graph

Authors: Debojyoti Bhattacharya, Subhabrata Paul

Abstract: Let $G=(V,E)$ be a simple undirected graph. The open neighbourhood of a vertex $v$ in $G$ is defined as $N_G(v)=\{u\in V~|~ uv\in E\}$; whereas the closed neighbourhood is defined as $N_G[v]= N_G(v)\cup \{v\}$. For an integer $k$, a subset $D\subseteq V$ is called a $k$-vertex-edge dominating set of $G$ if for every edge $uv\in E$, $|(N_G[u]\cup N_G[v]) \cap D|\geq k$. In $k$-vertex-edge dominatio… ▽ More Let $G=(V,E)$ be a simple undirected graph. The open neighbourhood of a vertex $v$ in $G$ is defined as $N_G(v)=\{u\in V~|~ uv\in E\}$; whereas the closed neighbourhood is defined as $N_G[v]= N_G(v)\cup \{v\}$. For an integer $k$, a subset $D\subseteq V$ is called a $k$-vertex-edge dominating set of $G$ if for every edge $uv\in E$, $|(N_G[u]\cup N_G[v]) \cap D|\geq k$. In $k$-vertex-edge domination problem, our goal is to find a $k$-vertex-edge dominating set of minimum cardinality of an input graph $G$. In this paper, we first prove that the decision version of $k$-vertex-edge domination problem is NP-complete for chordal graphs. On the positive side, we design a linear time algorithm for finding a minimum $k$-vertex-edge dominating set of tree. We also prove that there is a $O(\log(Δ(G)))$-approximation algorithm for this problem in general graph $G$, where $Δ(G)$ is the maximum degree of $G$. Then we show that for a graph $G$ with $n$ vertices, this problem cannot be approximated within a factor of $(1-ε) \ln n$ for any $ε>0$ unless $NP\subseteq DTIME(|V|^{O(\log\log|V|)})$. Finally, we prove that it is APX-complete for graphs with bounded degree $k+3$. △ Less

Submitted 11 October, 2023; originally announced October 2023.

arXiv:2310.06279 [pdf, other]

MEC-Intelligent Agent Support for Low-Latency Data Plane in Private NextG Core

Authors: Shalini Choudhury, Sushovan Das, Sanjoy Paul, Prasanthi Maddala, Ivan Seskar, Dipankar Raychaudhuri

Abstract: Private 5G networks will soon be ubiquitous across the future-generation smart wireless access infrastructures hosting a wide range of performance-critical applications. A high-performing User Plane Function (UPF) in the data plane is critical to achieving such stringent performance goals, as it governs fast packet processing and supports several key control-plane operations. Based on a private 5G… ▽ More Private 5G networks will soon be ubiquitous across the future-generation smart wireless access infrastructures hosting a wide range of performance-critical applications. A high-performing User Plane Function (UPF) in the data plane is critical to achieving such stringent performance goals, as it governs fast packet processing and supports several key control-plane operations. Based on a private 5G prototype implementation and analysis, it is imperative to perform dynamic resource management and orchestration at the UPF. This paper leverages Mobile Edge Cloud-Intelligent Agent (MEC-IA), a logically centralized entity that proactively distributes resources at UPF for various service types, significantly reducing the tail latency experienced by the user requests while maximizing resource utilization. Extending the MEC-IA functionality to MEC layers further incurs data plane latency reduction. Based on our extensive simulations, under skewed uRLLC traffic arrival, the MEC-IA assisted bestfit UPF-MEC scheme reduces the worst-case latency of UE requests by up to 77.8% w.r.t. baseline. Additionally, the system can increase uRLLC connectivity gain by 2.40x while obtaining 40% CapEx savings. △ Less

Submitted 9 October, 2023; originally announced October 2023.

arXiv:2309.14389 [pdf, other]

Analyzing the Efficacy of an LLM-Only Approach for Image-based Document Question Answering

Authors: Nidhi Hegde, Sujoy Paul, Gagan Madan, Gaurav Aggarwal

Abstract: Recent document question answering models consist of two key components: the vision encoder, which captures layout and visual elements in images, and a Large Language Model (LLM) that helps contextualize questions to the image and supplements them with external world knowledge to generate accurate answers. However, the relative contributions of the vision encoder and the language model in these ta… ▽ More Recent document question answering models consist of two key components: the vision encoder, which captures layout and visual elements in images, and a Large Language Model (LLM) that helps contextualize questions to the image and supplements them with external world knowledge to generate accurate answers. However, the relative contributions of the vision encoder and the language model in these tasks remain unclear. This is especially interesting given the effectiveness of instruction-tuned LLMs, which exhibit remarkable adaptability to new tasks. To this end, we explore the following aspects in this work: (1) The efficacy of an LLM-only approach on document question answering tasks (2) strategies for serializing textual information within document images and feeding it directly to an instruction-tuned LLM, thus bypassing the need for an explicit vision encoder (3) thorough quantitative analysis on the feasibility of such an approach. Our comprehensive analysis encompasses six diverse benchmark datasets, utilizing LLMs of varying scales. Our findings reveal that a strategy exclusively reliant on the LLM yields results that are on par with or closely approach state-of-the-art performance across a range of datasets. We posit that this evaluation framework will serve as a guiding resource for selecting appropriate datasets for future research endeavors that emphasize the fundamental importance of layout and image content information. △ Less

Submitted 25 September, 2023; originally announced September 2023.

arXiv:2308.15037 [pdf, other]

Is it an i or an l: Test-time Adaptation of Text Line Recognition Models

Authors: Debapriya Tula, Sujoy Paul, Gagan Madan, Peter Garst, Reeve Ingle, Gaurav Aggarwal

Abstract: Recognizing text lines from images is a challenging problem, especially for handwritten documents due to large variations in writing styles. While text line recognition models are generally trained on large corpora of real and synthetic data, such models can still make frequent mistakes if the handwriting is inscrutable or the image acquisition process adds corruptions, such as noise, blur, compre… ▽ More Recognizing text lines from images is a challenging problem, especially for handwritten documents due to large variations in writing styles. While text line recognition models are generally trained on large corpora of real and synthetic data, such models can still make frequent mistakes if the handwriting is inscrutable or the image acquisition process adds corruptions, such as noise, blur, compression, etc. Writing style is generally quite consistent for an individual, which can be leveraged to correct mistakes made by such models. Motivated by this, we introduce the problem of adapting text line recognition models during test time. We focus on a challenging and realistic setting where, given only a single test image consisting of multiple text lines, the task is to adapt the model such that it performs better on the image, without any labels. We propose an iterative self-training approach that uses feedback from the language model to update the optical model, with confident self-labels in each iteration. The confidence measure is based on an augmentation mechanism that evaluates the divergence of the prediction of the model in a local region. We perform rigorous evaluation of our method on several benchmark datasets as well as their corrupted versions. Experimental results on multiple datasets spanning multiple scripts show that the proposed adaptation method offers an absolute improvement of up to 8% in character error rate with just a few iterations of self-training at test time. △ Less

Submitted 29 August, 2023; originally announced August 2023.

arXiv:2308.09075 [pdf, other]

Fast Decision Support for Air Traffic Management at Urban Air Mobility Vertiports using Graph Learning

Authors: Prajit KrisshnaKumar, Jhoel Witter, Steve Paul, Hanvit Cho, Karthik Dantu, Souma Chowdhury

Abstract: Urban Air Mobility (UAM) promises a new dimension to decongested, safe, and fast travel in urban and suburban hubs. These UAM aircraft are conceived to operate from small airports called vertiports each comprising multiple take-off/landing and battery-recharging spots. Since they might be situated in dense urban areas and need to handle many aircraft landings and take-offs each hour, managing this… ▽ More Urban Air Mobility (UAM) promises a new dimension to decongested, safe, and fast travel in urban and suburban hubs. These UAM aircraft are conceived to operate from small airports called vertiports each comprising multiple take-off/landing and battery-recharging spots. Since they might be situated in dense urban areas and need to handle many aircraft landings and take-offs each hour, managing this schedule in real-time becomes challenging for a traditional air-traffic controller but instead calls for an automated solution. This paper provides a novel approach to this problem of Urban Air Mobility - Vertiport Schedule Management (UAM-VSM), which leverages graph reinforcement learning to generate decision-support policies. Here the designated physical spots within the vertiport's airspace and the vehicles being managed are represented as two separate graphs, with feature extraction performed through a graph convolutional network (GCN). Extracted features are passed onto perceptron layers to decide actions such as continue to hover or cruise, continue idling or take-off, or land on an allocated vertiport spot. Performance is measured based on delays, safety (no. of collisions) and battery consumption. Through realistic simulations in AirSim applied to scaled down multi-rotor vehicles, our results demonstrate the suitability of using graph reinforcement learning to solve the UAM-VSM problem and its superiority to basic reinforcement learning (with graph embeddings) or random choice baselines. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: Accepted for presentation in proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems 2023

arXiv:2308.02825 [pdf, other]

Burning a binary tree and its generalization

Authors: Sandip Das, Sk Samim Islam, Ritam M Mitra, Sanchita Paul

Abstract: Graph burning is a graph process that models the spread of social contagion. Initially, all the vertices of a graph $G$ are unburnt. At each step, an unburnt vertex is put on fire and the fire from burnt vertices of the previous step spreads to their adjacent unburnt vertices. This process continues till all the vertices are burnt. The burning number $b(G)$ of the graph $G$ is the minimum number o… ▽ More Graph burning is a graph process that models the spread of social contagion. Initially, all the vertices of a graph $G$ are unburnt. At each step, an unburnt vertex is put on fire and the fire from burnt vertices of the previous step spreads to their adjacent unburnt vertices. This process continues till all the vertices are burnt. The burning number $b(G)$ of the graph $G$ is the minimum number of steps required to burn all the vertices in the graph. The burning number conjecture by Bonato et al. states that for a connected graph $G$ of order $n$, its burning number $b(G) \leq \lceil \sqrt{n} \rceil$. It is easy to observe that in order to burn a graph it is enough to burn its spanning tree. Hence it suffices to prove that for any tree $T$ of order $n$, its burning number $b(T) \leq \lceil \sqrt{n} \rceil$ where $T$ is the spanning tree of $G$. It was proved in 2018 that $b(T) \leq \lceil \sqrt{n + n_2 + 1/4} +1/2 \rceil$ for a tree $T$ where $n_2$ is the number of degree $2$ vertices in $T$. In this paper, we provide an algorithm to burn a tree and we improve the existing bound using this algorithm. We prove that $b(T)\leq \lceil \sqrt{n + n_2 + 8}\rceil -1$ which is an improved bound for $n\geq 50$. We also provide an algorithm to burn some subclasses of the binary tree and prove the burning number conjecture for the same. △ Less

Submitted 14 November, 2023; v1 submitted 5 August, 2023; originally announced August 2023.

arXiv:2306.12213 [pdf, ps, other]

Limits for Learning with Language Models

Authors: Nicholas Asher, Swarnadeep Bhar, Akshay Chaturvedi, Julie Hunter, Soumya Paul

Abstract: With the advent of large language models (LLMs), the trend in NLP has been to train LLMs on vast amounts of data to solve diverse language understanding and generation tasks. The list of LLM successes is long and varied. Nevertheless, several recent papers provide empirical evidence that LLMs fail to capture important aspects of linguistic meaning. Focusing on universal quantification, we provide… ▽ More With the advent of large language models (LLMs), the trend in NLP has been to train LLMs on vast amounts of data to solve diverse language understanding and generation tasks. The list of LLM successes is long and varied. Nevertheless, several recent papers provide empirical evidence that LLMs fail to capture important aspects of linguistic meaning. Focusing on universal quantification, we provide a theoretical foundation for these empirical findings by proving that LLMs cannot learn certain fundamental semantic properties including semantic entailment and consistency as they are defined in formal semantics. More generally, we show that LLMs are unable to learn concepts beyond the first level of the Borel Hierarchy, which imposes severe limits on the ability of LMs, both large and small, to capture many aspects of linguistic meaning. This means that LLMs will continue to operate without formal guarantees on tasks that require entailments and deep linguistic understanding. △ Less

Submitted 21 June, 2023; originally announced June 2023.

arXiv:2306.06823 [pdf, other]

Weakly supervised information extraction from inscrutable handwritten document images

Authors: Sujoy Paul, Gagan Madan, Akankshya Mishra, Narayan Hegde, Pradeep Kumar, Gaurav Aggarwal

Abstract: State-of-the-art information extraction methods are limited by OCR errors. They work well for printed text in form-like documents, but unstructured, handwritten documents still remain a challenge. Adapting existing models to domain-specific training data is quite expensive, because of two factors, 1) limited availability of the domain-specific documents (such as handwritten prescriptions, lab note… ▽ More State-of-the-art information extraction methods are limited by OCR errors. They work well for printed text in form-like documents, but unstructured, handwritten documents still remain a challenge. Adapting existing models to domain-specific training data is quite expensive, because of two factors, 1) limited availability of the domain-specific documents (such as handwritten prescriptions, lab notes, etc.), and 2) annotations become even more challenging as one needs domain-specific knowledge to decode inscrutable handwritten document images. In this work, we focus on the complex problem of extracting medicine names from handwritten prescriptions using only weakly labeled data. The data consists of images along with the list of medicine names in it, but not their location in the image. We solve the problem by first identifying the regions of interest, i.e., medicine lines from just weak labels and then injecting a domain-specific medicine language model learned using only synthetically generated data. Compared to off-the-shelf state-of-the-art methods, our approach performs >2.5x better in medicine names extraction from prescriptions. △ Less

Submitted 11 June, 2023; originally announced June 2023.

Comments: Accepted at ICDAR 2023

arXiv:2306.05243 [pdf, ps, other]

Analysis of Knuth's Sampling Algorithm D and D'

Authors: Mridul Nandi, Soumit Paul

Abstract: In this research paper, we address the Distinct Elements estimation problem in the context of streaming algorithms. The problem involves estimating the number of distinct elements in a given data stream $\mathcal{A} = (a_1, a_2,\ldots, a_m)$, where $a_i \in \{1, 2, \ldots, n\}$. Over the past four decades, the Distinct Elements problem has received considerable attention, theoretically and empiric… ▽ More In this research paper, we address the Distinct Elements estimation problem in the context of streaming algorithms. The problem involves estimating the number of distinct elements in a given data stream $\mathcal{A} = (a_1, a_2,\ldots, a_m)$, where $a_i \in \{1, 2, \ldots, n\}$. Over the past four decades, the Distinct Elements problem has received considerable attention, theoretically and empirically, leading to the development of space-optimal algorithms. A recent sampling-based algorithm proposed by Chakraborty et al.[11] has garnered significant interest and has even attracted the attention of renowned computer scientist Donald E. Knuth, who wrote an article on the same topic [6] and called the algorithm CVM. In this paper, we thoroughly examine the algorithms (referred to as CVM1, CVM2 in [11] and DonD, DonD' in [6]. We first unify all these algorithms and call them cutoff-based algorithms. Then we provide an approximation and biasedness analysis of these algorithms. △ Less

Submitted 11 June, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: We have provided an unbiased analysis (using exactly the same idea as the previous version) for the continuous score distribution instead of the discrete version

MSC Class: F.2.0;

arXiv:2306.04047 [pdf, other]

CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments

Authors: Xiulong Liu, Sudipta Paul, Moitreya Chatterjee, Anoop Cherian

Abstract: Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget… ▽ More Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget-aware partially observable semi-Markov decision process that implicitly learns the uncertainty in the audio-based navigation policy to decide when and how the agent may interact with the oracle. Our CAVEN agent can engage in fully-bidirectional natural language conversations by producing relevant questions and interpret free-form, potentially noisy responses from the oracle based on the audio-visual context. To enable such a capability, CAVEN is equipped with: (i) a trajectory forecasting network that is grounded in audio-visual cues to produce a potential trajectory to the estimated goal, and (ii) a natural language based question generation and reasoning network to pose an interactive question to the oracle or interpret the oracle's response to produce navigation instructions. To train the interactive modules, we present a large scale dataset: AVN-Instruct, based on the Landmark-RxR dataset. To substantiate the usefulness of conversations, we present experiments on the benchmark audio-goal task using the SoundSpaces simulator under various noisy settings. Our results reveal that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate, especially in localizing new sound sources and against methods that only use uni-directional interaction. △ Less

Submitted 26 December, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

Comments: Accepted at AAAI 2024

arXiv:2306.03542 [pdf, other]

Masked Autoencoders are Efficient Continual Federated Learners

Authors: Subarnaduti Paul, Lars-Joel Frey, Roshni Kamath, Kristian Kersting, Martin Mundt

Abstract: Machine learning is typically framed from a perspective of i.i.d., and more importantly, isolated data. In parts, federated learning lifts this assumption, as it sets out to solve the real-world challenge of collaboratively learning a shared model from data distributed across clients. However, motivated primarily by privacy and computational constraints, the fact that data may change, distribution… ▽ More Machine learning is typically framed from a perspective of i.i.d., and more importantly, isolated data. In parts, federated learning lifts this assumption, as it sets out to solve the real-world challenge of collaboratively learning a shared model from data distributed across clients. However, motivated primarily by privacy and computational constraints, the fact that data may change, distributions drift, or even tasks advance individually on clients, is seldom taken into account. The field of continual learning addresses this separate challenge and first steps have recently been taken to leverage synergies in distributed supervised settings, in which several clients learn to solve changing classification tasks over time without forgetting previously seen ones. Motivated by these prior works, we posit that such federated continual learning should be grounded in unsupervised learning of representations that are shared across clients; in the loose spirit of how humans can indirectly leverage others' experience without exposure to a specific task. For this purpose, we demonstrate that masked autoencoders for distribution estimation are particularly amenable to this setup. Specifically, their masking strategy can be seamlessly integrated with task attention mechanisms to enable selective knowledge transfer between clients. We empirically corroborate the latter statement through several continual federated scenarios on both image and binary datasets. △ Less

Submitted 18 July, 2024; v1 submitted 6 June, 2023; originally announced June 2023.

arXiv:2305.01442 [pdf, ps, other]

A Direct Construction of Optimal Symmetrical Z-Complementary Code Sets of Prime Power Lengths

Authors: Praveen Kumar, Sudhan Majhi, Subhabrata Paul

Abstract: This paper presents a direct construction of an optimal symmetrical Z-complementary code set (SZCCS) of prime power lengths using a multi-variable function (MVF). SZCCS is a natural extension of the Z-complementary code set (ZCCS), which has only front-end zero correlation zone (ZCZ) width. SZCCS has both front-end and tail-end ZCZ width. SZCCSs are used in developing optimal training sequences fo… ▽ More This paper presents a direct construction of an optimal symmetrical Z-complementary code set (SZCCS) of prime power lengths using a multi-variable function (MVF). SZCCS is a natural extension of the Z-complementary code set (ZCCS), which has only front-end zero correlation zone (ZCZ) width. SZCCS has both front-end and tail-end ZCZ width. SZCCSs are used in developing optimal training sequences for broadband generalized spatial modulation systems over frequency-selective channels because they have ZCZ width on both the front and tail ends. The construction of optimal SZCCS with large set sizes and prime power lengths is presented for the first time in this paper. Furthermore, it is worth noting that several existing works on ZCCS and SZCCS can be viewed as special cases of the proposed construction. △ Less

Submitted 2 May, 2023; originally announced May 2023.

arXiv:2304.14604 [pdf, other]

doi 10.1016/j.cam.2024.115782

Deep Neural-network Prior for Orbit Recovery from Method of Moments

Authors: Yuehaw Khoo, Sounak Paul, Nir Sharon

Abstract: Orbit recovery problems are a class of problems that often arise in practice and various forms. In these problems, we aim to estimate an unknown function after being distorted by a group action and observed via a known operator. Typically, the observations are contaminated with a non-trivial level of noise. Two particular orbit recovery problems of interest in this paper are multireference alignme… ▽ More Orbit recovery problems are a class of problems that often arise in practice and various forms. In these problems, we aim to estimate an unknown function after being distorted by a group action and observed via a known operator. Typically, the observations are contaminated with a non-trivial level of noise. Two particular orbit recovery problems of interest in this paper are multireference alignment and single-particle cryo-EM modelling. In order to suppress the noise, we suggest using the method of moments approach for both problems while introducing deep neural network priors. In particular, our neural networks should output the signals and the distribution of group elements, with moments being the input. In the multireference alignment case, we demonstrate the advantage of using the NN to accelerate the convergence for the reconstruction of signals from the moments. Finally, we use our method to reconstruct simulated and biological volumes in the cryo-EM setting. △ Less

Submitted 30 January, 2024; v1 submitted 27 April, 2023; originally announced April 2023.

Journal ref: J. Comput. Appl. Math. 115782 (2024)

arXiv:2303.08954 [pdf, other]

PRESTO: A Multilingual Dataset for Parsing Realistic Task-Oriented Dialogs

Authors: Rahul Goel, Waleed Ammar, Aditya Gupta, Siddharth Vashishtha, Motoki Sano, Faiz Surani, Max Chang, HyunJeong Choe, David Greene, Kyle He, Rattima Nitisaroj, Anna Trukhina, Shachi Paul, Pararth Shah, Rushin Shah, Zhou Yu

Abstract: Research interest in task-oriented dialogs has increased as systems such as Google Assistant, Alexa and Siri have become ubiquitous in everyday life. However, the impact of academic research in this area has been limited by the lack of datasets that realistically capture the wide array of user pain points. To enable research on some of the more challenging aspects of parsing realistic conversation… ▽ More Research interest in task-oriented dialogs has increased as systems such as Google Assistant, Alexa and Siri have become ubiquitous in everyday life. However, the impact of academic research in this area has been limited by the lack of datasets that realistically capture the wide array of user pain points. To enable research on some of the more challenging aspects of parsing realistic conversations, we introduce PRESTO, a public dataset of over 550K contextual multilingual conversations between humans and virtual assistants. PRESTO contains a diverse array of challenges that occur in real-world NLU tasks such as disfluencies, code-switching, and revisions. It is the only large scale human generated conversational parsing dataset that provides structured context such as a user's contacts and lists for each example. Our mT5 model based baselines demonstrate that the conversational phenomenon present in PRESTO are challenging to model, which is further pronounced in a low-resource setup. △ Less

Submitted 16 March, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: PRESTO v1 Release

arXiv:2303.08933 [pdf, other]

Efficient Planning of Multi-Robot Collective Transport using Graph Reinforcement Learning with Higher Order Topological Abstraction

Authors: Steve Paul, Wenyuan Li, Brian Smyth, Yuzhou Chen, Yulia Gel, Souma Chowdhury

Abstract: Efficient multi-robot task allocation (MRTA) is fundamental to various time-sensitive applications such as disaster response, warehouse operations, and construction. This paper tackles a particular class of these problems that we call MRTA-collective transport or MRTA-CT -- here tasks present varying workloads and deadlines, and robots are subject to flight range, communication range, and payload… ▽ More Efficient multi-robot task allocation (MRTA) is fundamental to various time-sensitive applications such as disaster response, warehouse operations, and construction. This paper tackles a particular class of these problems that we call MRTA-collective transport or MRTA-CT -- here tasks present varying workloads and deadlines, and robots are subject to flight range, communication range, and payload constraints. For large instances of these problems involving 100s-1000's of tasks and 10s-100s of robots, traditional non-learning solvers are often time-inefficient, and emerging learning-based policies do not scale well to larger-sized problems without costly retraining. To address this gap, we use a recently proposed encoder-decoder graph neural network involving Capsule networks and multi-head attention mechanism, and innovatively add topological descriptors (TD) as new features to improve transferability to unseen problems of similar and larger size. Persistent homology is used to derive the TD, and proximal policy optimization is used to train our TD-augmented graph neural network. The resulting policy model compares favorably to state-of-the-art non-learning baselines while being much faster. The benefit of using TD is readily evident when scaling to test problems of size larger than those used in training. △ Less

Submitted 17 August, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: This paper has been accepted to be presented at the IEEE International Conference on Robotics and Automation, 2023

arXiv:2303.01243 [pdf, other]

doi 10.1145/3572864.3581586

Poster: Sponge ML Model Attacks of Mobile Apps

Authors: Souvik Paul, Nicolas Kourtellis

Abstract: Machine Learning (ML)-powered apps are used in pervasive devices such as phones, tablets, smartwatches and IoT devices. Recent advances in collaborative, distributed ML such as Federated Learning (FL) attempt to solve privacy concerns of users and data owners, and thus used by tech industry leaders such as Google, Facebook and Apple. However, FL systems and models are still vulnerable to adversari… ▽ More Machine Learning (ML)-powered apps are used in pervasive devices such as phones, tablets, smartwatches and IoT devices. Recent advances in collaborative, distributed ML such as Federated Learning (FL) attempt to solve privacy concerns of users and data owners, and thus used by tech industry leaders such as Google, Facebook and Apple. However, FL systems and models are still vulnerable to adversarial membership and attribute inferences and model poisoning attacks, especially in FL-as-a-Service ecosystems recently proposed, which can enable attackers to access multiple ML-powered apps. In this work, we focus on the recently proposed Sponge attack: It is designed to soak up energy consumed while executing inference (not training) of ML model, without hampering the classifier's performance. Recent work has shown sponge attacks on ASCI-enabled GPUs can potentially escalate the power consumption and inference time. For the first time, in this work, we investigate this attack in the mobile setting and measure the effect it can have on ML models running inside apps on mobile devices. △ Less

Submitted 1 March, 2023; originally announced March 2023.

Comments: 2 pages, 6 figures. Proceedings of the 24th International Workshop on Mobile Computing Systems and Applications (HotMobile). Feb. 2023

MSC Class: 68M25; 68P27; 68Txx ACM Class: I.2.11

arXiv:2302.05849 [pdf, other]

Graph Learning Based Decision Support for Multi-Aircraft Take-Off and Landing at Urban Air Mobility Vertiports

Authors: Prajit KrisshnaKumar, Jhoel Witter, Steve Paul, Karthik Dantu, Souma Chowdhury

Abstract: Majority of aircraft under the Urban Air Mobility (UAM) concept are expected to be of the electric vertical takeoff and landing (eVTOL) vehicle type, which will operate out of vertiports. While this is akin to the relationship between general aviation aircraft and airports, the conceived location of vertiports within dense urban environments presents unique challenges in managing the air traffic s… ▽ More Majority of aircraft under the Urban Air Mobility (UAM) concept are expected to be of the electric vertical takeoff and landing (eVTOL) vehicle type, which will operate out of vertiports. While this is akin to the relationship between general aviation aircraft and airports, the conceived location of vertiports within dense urban environments presents unique challenges in managing the air traffic served by a vertiport. This challenge becomes pronounced within increasing frequency of scheduled landings and take-offs. This paper assumes a centralized air traffic controller (ATC) to explore the performance of a new AI driven ATC approach to manage the eVTOLs served by the vertiport. Minimum separation-driven safety and delays are the two important considerations in this case. The ATC problem is modeled as a task allocation problem, and uncertainties due to communication disruptions (e.g., poor link quality) and inclement weather (e.g., high gust effects) are added as a small probability of action failures. To learn the vertiport ATC policy, a novel graph-based reinforcement learning (RL) solution called "Urban Air Mobility- Vertiport Schedule Management (UAM-VSM)" is developed. This approach uses graph convolutional networks (GCNs) to abstract the vertiport space and eVTOL space as graphs, and aggregate information for a centralized ATC agent to help generalize the environment. Unreal Engine combined with Airsim is used as the simulation environment over which training and testing occurs. Uncertainties are considered only during testing, due to the high cost of Mc sampling over such realistic simulations. The proposed graph RL method demonstrates significantly better performance on the test scenarios when compared against a feasible random decision-making baseline and a first come first serve (FCFS) baseline, including the ability to generalize to unseen scenarios and with uncertainties. △ Less

Submitted 11 February, 2023; originally announced February 2023.

Comments: Presented at AIAA Scitech Forum 2022

Showing 1–50 of 216 results for author: Paul, S