-
The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation
Authors:
Samee Arif,
Sualeha Farid,
Abdul Hameed Azeemi,
Awais Athar,
Agha Ali Raza
Abstract:
This paper presents synthetic Preference Optimization (PO) datasets generated using multi-agent workflows and evaluates the effectiveness and potential of these workflows in the dataset generation process. PO dataset generation requires two modules: (1) response evaluation, and (2) response generation. In the response evaluation module, the responses from Large Language Models (LLMs) are evaluated…
▽ More
This paper presents synthetic Preference Optimization (PO) datasets generated using multi-agent workflows and evaluates the effectiveness and potential of these workflows in the dataset generation process. PO dataset generation requires two modules: (1) response evaluation, and (2) response generation. In the response evaluation module, the responses from Large Language Models (LLMs) are evaluated and ranked - a task typically carried out by human annotators that we automate using LLMs. We assess the response evaluation module in a 2 step process. In step 1, we assess LLMs as evaluators using three distinct prompting strategies. In step 2, we apply the winning prompting strategy to compare the performance of LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. In each step, we use inter-rater agreement using Cohen's Kappa between human annotators and LLMs. For the response generation module, we compare different configurations for the LLM Feedback Loop using the identified LLM evaluator configuration. We use the win rate (the fraction of times a generation framework is selected as the best by an LLM evaluator) to determine the best multi-agent configuration for generation. After identifying the best configurations for both modules, we use models from the GPT, Gemma, and Llama families to generate our PO datasets using the above pipeline. We generate two types of PO datasets, one to improve the generation capabilities of individual LLM and the other to improve the multi-agent workflow. Our evaluation shows that GPT-4o-as-a-Judge is more consistent across datasets when the candidate responses do not include responses from the GPT family. Additionally, we find that the LLM Feedback Loop, with Llama as the generator and Gemma as the reviewer, achieves a notable 71.8% and 73.8% win rate over single-agent Llama and Gemma, respectively.
△ Less
Submitted 24 August, 2024; v1 submitted 16 August, 2024;
originally announced August 2024.
-
Generalists vs. Specialists: Evaluating Large Language Models for Urdu
Authors:
Samee Arif,
Abdul Hameed Azeemi,
Agha Ali Raza,
Awais Athar
Abstract:
In this paper, we compare general-purpose pretrained models, GPT-4-Turbo and Llama-3-8b-Instruct with special-purpose models fine-tuned on specific tasks, XLM-Roberta-large, mT5-large, and Llama-3-8b-Instruct. We focus on seven classification and six generation tasks to evaluate the performance of these models on Urdu language. Urdu has 70 million native speakers, yet it remains underrepresented i…
▽ More
In this paper, we compare general-purpose pretrained models, GPT-4-Turbo and Llama-3-8b-Instruct with special-purpose models fine-tuned on specific tasks, XLM-Roberta-large, mT5-large, and Llama-3-8b-Instruct. We focus on seven classification and six generation tasks to evaluate the performance of these models on Urdu language. Urdu has 70 million native speakers, yet it remains underrepresented in Natural Language Processing (NLP). Despite the frequent advancements in Large Language Models (LLMs), their performance in low-resource languages, including Urdu, still needs to be explored. We also conduct a human evaluation for the generation tasks and compare the results with the evaluations performed by GPT-4-Turbo and Llama-3-8b-Instruct. We find that special-purpose models consistently outperform general-purpose models across various tasks. We also find that the evaluation done by GPT-4-Turbo for generation tasks aligns more closely with human evaluation compared to the evaluation by Llama-3-8b-Instruct. This paper contributes to the NLP community by providing insights into the effectiveness of general and specific-purpose LLMs for low-resource languages.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
UQA: Corpus for Urdu Question Answering
Authors:
Samee Arif,
Sualeha Farid,
Awais Athar,
Agha Ali Raza
Abstract:
This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated con…
▽ More
This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at www.github.com/sameearif/UQA.
△ Less
Submitted 22 July, 2024; v1 submitted 2 May, 2024;
originally announced May 2024.
-
Leak Proof CMap; a framework for training and evaluation of cell line agnostic L1000 similarity methods
Authors:
Steven Shave,
Richard Kasprowicz,
Abdullah M. Athar,
Denise Vlachou,
Neil O. Carragher,
Cuong Q. Nguyen
Abstract:
The Connectivity Map (CMap) is a large publicly available database of cellular transcriptomic responses to chemical and genetic perturbations built using a standardized acquisition protocol known as the L1000 technique. Databases such as CMap provide an exciting opportunity to enrich drug discovery efforts, providing a 'known' phenotypic landscape to explore and enabling the development of state o…
▽ More
The Connectivity Map (CMap) is a large publicly available database of cellular transcriptomic responses to chemical and genetic perturbations built using a standardized acquisition protocol known as the L1000 technique. Databases such as CMap provide an exciting opportunity to enrich drug discovery efforts, providing a 'known' phenotypic landscape to explore and enabling the development of state of the art techniques for enhanced information extraction and better informed decisions. Whilst multiple methods for measuring phenotypic similarity and interrogating profiles have been developed, the field is severely lacking standardized benchmarks using appropriate data splitting for training and unbiased evaluation of machine learning methods. To address this, we have developed 'Leak Proof CMap' and exemplified its application to a set of common transcriptomic and generic phenotypic similarity methods along with an exemplar triplet loss-based method. Benchmarking in three critical performance areas (compactness, distinctness, and uniqueness) is conducted using carefully crafted data splits ensuring no similar cell lines or treatments with shared or closely matching responses or mechanisms of action are present in training, validation, or test sets. This enables testing of models with unseen samples akin to exploring treatments with novel modes of action in novel patient derived cell lines. With a carefully crafted benchmark and data splitting regime in place, the tooling now exists to create performant phenotypic similarity methods for use in personalized medicine (novel cell lines) and to better augment high throughput phenotypic screening technologies with the L1000 transcriptomic technology.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
4D-Former: Multimodal 4D Panoptic Segmentation
Authors:
Ali Athar,
Enxu Li,
Sergio Casas,
Raquel Urtasun
Abstract:
4D panoptic segmentation is a challenging but practically useful task that requires every point in a LiDAR point-cloud sequence to be assigned a semantic class label, and individual objects to be segmented and tracked over time. Existing approaches utilize only LiDAR inputs which convey limited information in regions with point sparsity. This problem can, however, be mitigated by utilizing RGB cam…
▽ More
4D panoptic segmentation is a challenging but practically useful task that requires every point in a LiDAR point-cloud sequence to be assigned a semantic class label, and individual objects to be segmented and tracked over time. Existing approaches utilize only LiDAR inputs which convey limited information in regions with point sparsity. This problem can, however, be mitigated by utilizing RGB camera images which offer appearance-based information that can reinforce the geometry-based LiDAR features. Motivated by this, we propose 4D-Former: a novel method for 4D panoptic segmentation which leverages both LiDAR and image modalities, and predicts semantic masks as well as temporally consistent object masks for the input point-cloud sequence. We encode semantic classes and objects using a set of concise queries which absorb feature information from both data modalities. Additionally, we propose a learned mechanism to associate object tracks over time which reasons over both appearance and spatial location. We apply 4D-Former to the nuScenes and SemanticKITTI datasets where it achieves state-of-the-art results.
△ Less
Submitted 17 November, 2023; v1 submitted 2 November, 2023;
originally announced November 2023.
-
Effects of $f (R, G)$ gravity on anisotropic charged compact objects
Authors:
M. Ilyas,
A. R. Athar,
F. Khan,
Asma Anfal
Abstract:
The present study provides an in-depth analysis of the anisotropic matter distribution and various physical aspects of compact stars in the context of a $f(R,G)$-gravity framework. In order to gain an exhaustive understanding of these aspects, our study focuses on three particular compact stars: VELA X-1 (CS1), SAXJ1808.4-3658 (CS2), and 4U1820-30 (CS3). We conducted calculations on the relevant c…
▽ More
The present study provides an in-depth analysis of the anisotropic matter distribution and various physical aspects of compact stars in the context of a $f(R,G)$-gravity framework. In order to gain an exhaustive understanding of these aspects, our study focuses on three particular compact stars: VELA X-1 (CS1), SAXJ1808.4-3658 (CS2), and 4U1820-30 (CS3). We conducted calculations on the relevant characteristics of these compact stars by employing three different models of $f(R,G)$-gravity. As a convenient approach, the $f(R,G)$-gravity is organized into two distinct components, which include $f_1(R)$ and $f_2(G)$. The $R$ dependent component is modeled similarly to the Hu-Sawicki approach, while for modeling the $G$ dependent component, we chose logarithmic and power law-like approaches and suggested three viable gravity models. Graphical methods are used to analyze the physical properties of the compact stars in the domain of suggested models of gravity.
△ Less
Submitted 21 August, 2023; v1 submitted 4 May, 2023;
originally announced May 2023.
-
TarViS: A Unified Approach for Target-based Video Segmentation
Authors:
Ali Athar,
Alexander Hermans,
Jonathon Luiten,
Deva Ramanan,
Bastian Leibe
Abstract:
The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied t…
▽ More
The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two. Code and model weights are available at: https://github.com/Ali2500/TarViS
△ Less
Submitted 10 May, 2023; v1 submitted 6 January, 2023;
originally announced January 2023.
-
BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video
Authors:
Ali Athar,
Jonathon Luiten,
Paul Voigtlaender,
Tarasha Khurana,
Achal Dave,
Bastian Leibe,
Deva Ramanan
Abstract:
Multiple existing benchmarks involve tracking and segmenting objects in video e.g., Video Object Segmentation (VOS) and Multi-Object Tracking and Segmentation (MOTS), but there is little interaction between them due to the use of disparate benchmark datasets and metrics (e.g. J&F, mAP, sMOTSA). As a result, published works usually target a particular benchmark, and are not easily comparable to eac…
▽ More
Multiple existing benchmarks involve tracking and segmenting objects in video e.g., Video Object Segmentation (VOS) and Multi-Object Tracking and Segmentation (MOTS), but there is little interaction between them due to the use of disparate benchmark datasets and metrics (e.g. J&F, mAP, sMOTSA). As a result, published works usually target a particular benchmark, and are not easily comparable to each another. We believe that the development of generalized methods that can tackle multiple tasks requires greater cohesion among these research sub-communities. In this paper, we aim to facilitate this by proposing BURST, a dataset which contains thousands of diverse videos with high-quality object masks, and an associated benchmark with six tasks involving object tracking and segmentation in video. All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison, and hence, more effectively pool knowledge from different methods across different tasks. Additionally, we demonstrate several baselines for all tasks and show that approaches for one task can be applied to another with a quantifiable and explainable performance difference. Dataset annotations and evaluation code is available at: https://github.com/Ali2500/BURST-benchmark.
△ Less
Submitted 22 November, 2022; v1 submitted 24 September, 2022;
originally announced September 2022.
-
Some Specific Wormhole Solutions in Extended $f(R,G,T)$ Gravity
Authors:
M. Ilyas,
A. R. Athar,
Fawad Khan,
Nasreen Ghafoor,
Haifa I. Alrebdi,
Kottakkaran Sooppy Nisar,
Abdel-Haleem Abdel-Aty
Abstract:
This research work provides an exhaustive investigation of the viability of different coupled wormhole (WH) geometries with the relativistic matter configurations in the $f(R,G,T)$ extended gravity framework. We consider a specific model in the context of $f(R,G,T)$-gravity for this purpose. Also, we assume a static spherically symmetric space-time geometry and a unique distribution of matter with…
▽ More
This research work provides an exhaustive investigation of the viability of different coupled wormhole (WH) geometries with the relativistic matter configurations in the $f(R,G,T)$ extended gravity framework. We consider a specific model in the context of $f(R,G,T)$-gravity for this purpose. Also, we assume a static spherically symmetric space-time geometry and a unique distribution of matter with a set of shape functions ($β(r)$) for analyzing different energy conditions (ECs). In addition to this, we examined WH-models in the equilibrium scenario by employing anisotropic fluid. The corresponding results are obtained using numerical methods and then presented using different plots. In this case, $f(R,G,T)$ gravity generates additional curvature quantities, which can be thought of as gravitational objects that maintain irregular WH-situations. Based on our findings, we conclude that in the absence of exotic matter, WH can exist in some specific regions of the parametric space using modified gravity model as, $f(R,G,T) = R +αR^2+βG^n+γG\ln(G)+λT$.
△ Less
Submitted 12 May, 2023; v1 submitted 1 July, 2022;
originally announced July 2022.
-
Differentiable Soft-Masked Attention
Authors:
Ali Athar,
Jonathon Luiten,
Alexander Hermans,
Deva Ramanan,
Bastian Leibe
Abstract:
Transformers have become prevalent in computer vision due to their performance and flexibility in modelling complex operations. Of particular significance is the 'cross-attention' operation, which allows a vector representation (e.g. of an object in an image) to be learned by attending to an arbitrarily sized set of input features. Recently, "Masked Attention" was proposed in which a given object…
▽ More
Transformers have become prevalent in computer vision due to their performance and flexibility in modelling complex operations. Of particular significance is the 'cross-attention' operation, which allows a vector representation (e.g. of an object in an image) to be learned by attending to an arbitrarily sized set of input features. Recently, "Masked Attention" was proposed in which a given object representation only attends to those image pixel features for which the segmentation mask of that object is active. This specialization of attention proved beneficial for various image and video segmentation tasks. In this paper, we propose another specialization of attention which enables attending over `soft-masks' (those with continuous mask probabilities instead of binary values), and is also differentiable through these mask probabilities, thus allowing the mask used for attention to be learned within the network without requiring direct loss supervision. This can be useful for several applications. Specifically, we employ our "Differentiable Soft-Masked Attention" for the task of Weakly-Supervised Video Object Segmentation (VOS), where we develop a transformer-based network for VOS which only requires a single annotated image frame for training, but can also benefit from cycle consistency training on a video with just one annotated frame. Although there is no loss for masks in unlabeled frames, the network is still able to segment objects in those frames due to our novel attention formulation. Code: https://github.com/Ali2500/HODOR/blob/main/hodor/modelling/encoder/soft_masked_attention.py
△ Less
Submitted 5 August, 2022; v1 submitted 31 May, 2022;
originally announced June 2022.
-
Charged Compact Stars in Extended $f(\mathcal{R},\mathcal{G},\mathcal{T})$ Gravity
Authors:
M. Ilyas,
A. R. Athar,
Asma Bibi
Abstract:
The purpose of this paper is to study charged compact stars using extended gravitational theory, also known as $f(\mathcal{R}, \mathcal{G}, \mathcal{T})$ gravity. Alternatively, this theory is also called $f(\mathcal{R}, \mathcal{T}, \mathcal{G})$ gravity. The symbols $\mathcal{R}, \mathcal{G}$, and $\mathcal{T}$ denote the Ricci Scalar, the Gauss-Bonnet invariant, and the trace of the energy-mome…
▽ More
The purpose of this paper is to study charged compact stars using extended gravitational theory, also known as $f(\mathcal{R}, \mathcal{G}, \mathcal{T})$ gravity. Alternatively, this theory is also called $f(\mathcal{R}, \mathcal{T}, \mathcal{G})$ gravity. The symbols $\mathcal{R}, \mathcal{G}$, and $\mathcal{T}$ denote the Ricci Scalar, the Gauss-Bonnet invariant, and the trace of the energy-momentum tensor, respectively. We suggested several plausible models in the framework of this new gravity theory, and then used these models to explore several physical properties of compact objects of relativistic nature. This research also takes into account three famous compact stars: Vela X-1 (CS1); SAXJ1808.4-3658 (CS2); and 4U1820-30 (CS3). Moreover, using the suggested models, the physical nature of anisotropic stress, energy density, various energy conditions (ECs), the state of equilibrium, interior stability, mass variations, compactness, anisotropy, electric charge, and electric field intensity are analysed for considered compact stars. Different plots of the above-mentioned quantities are presented for this analysis. Conclusively, the ECs are satisfied, and the compact stars have a significant dense core.
△ Less
Submitted 27 April, 2023; v1 submitted 29 May, 2022;
originally announced May 2022.
-
HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images
Authors:
Ali Athar,
Jonathon Luiten,
Alexander Hermans,
Deva Ramanan,
Bastian Leibe
Abstract:
Existing state-of-the-art methods for Video Object Segmentation (VOS) learn low-level pixel-to-pixel correspondences between frames to propagate object masks across video. This requires a large amount of densely annotated video data, which is costly to annotate, and largely redundant since frames within a video are highly correlated. In light of this, we propose HODOR: a novel method that tackles…
▽ More
Existing state-of-the-art methods for Video Object Segmentation (VOS) learn low-level pixel-to-pixel correspondences between frames to propagate object masks across video. This requires a large amount of densely annotated video data, which is costly to annotate, and largely redundant since frames within a video are highly correlated. In light of this, we propose HODOR: a novel method that tackles VOS by effectively leveraging annotated static images for understanding object appearance and scene context. We encode object instances and scene information from an image frame into robust high-level descriptors which can then be used to re-segment those objects in different frames. As a result, HODOR achieves state-of-the-art performance on the DAVIS and YouTube-VOS benchmarks compared to existing methods trained without video annotations. Without any architectural modification, HODOR can also learn from video context around single annotated video frames by utilizing cyclic consistency, whereas other methods rely on dense, temporally consistent annotations. Source code is available at: https://github.com/Ali2500/HODOR
△ Less
Submitted 15 July, 2022; v1 submitted 16 December, 2021;
originally announced December 2021.
-
D^2Conv3D: Dynamic Dilated Convolutions for Object Segmentation in Videos
Authors:
Christian Schmidt,
Ali Athar,
Sabarinath Mahadevan,
Bastian Leibe
Abstract:
Despite receiving significant attention from the research community, the task of segmenting and tracking objects in monocular videos still has much room for improvement. Existing works have simultaneously justified the efficacy of dilated and deformable convolutions for various image-level segmentation tasks. This gives reason to believe that 3D extensions of such convolutions should also yield pe…
▽ More
Despite receiving significant attention from the research community, the task of segmenting and tracking objects in monocular videos still has much room for improvement. Existing works have simultaneously justified the efficacy of dilated and deformable convolutions for various image-level segmentation tasks. This gives reason to believe that 3D extensions of such convolutions should also yield performance improvements for video-level segmentation tasks. However, this aspect has not yet been explored thoroughly in existing literature. In this paper, we propose Dynamic Dilated Convolutions (D^2Conv3D): a novel type of convolution which draws inspiration from dilated and deformable convolutions and extends them to the 3D (spatio-temporal) domain. We experimentally show that D^2Conv3D can be used to improve the performance of multiple 3D CNN architectures across multiple video segmentation related benchmarks by simply employing D^2Conv3D as a drop-in replacement for standard convolutions. We further show that D^2Conv3D out-performs trivial extensions of existing dilated and deformable convolutions to 3D. Lastly, we set a new state-of-the-art on the DAVIS 2016 Unsupervised Video Object Segmentation benchmark. Code is made publicly available at https://github.com/Schmiddo/d2conv3d .
△ Less
Submitted 15 November, 2021;
originally announced November 2021.
-
Making a Case for 3D Convolutions for Object Segmentation in Videos
Authors:
Sabarinath Mahadevan,
Ali Athar,
Aljoša Ošep,
Sebastian Hennen,
Laura Leal-Taixé,
Bastian Leibe
Abstract:
The task of object segmentation in videos is usually accomplished by processing appearance and motion information separately using standard 2D convolutional networks, followed by a learned fusion of the two sources of information. On the other hand, 3D convolutional networks have been successfully applied for video classification tasks, but have not been leveraged as effectively to problems involv…
▽ More
The task of object segmentation in videos is usually accomplished by processing appearance and motion information separately using standard 2D convolutional networks, followed by a learned fusion of the two sources of information. On the other hand, 3D convolutional networks have been successfully applied for video classification tasks, but have not been leveraged as effectively to problems involving dense per-pixel interpretation of videos compared to their 2D convolutional counterparts and lag behind the aforementioned networks in terms of performance. In this work, we show that 3D CNNs can be effectively applied to dense video prediction tasks such as salient object segmentation. We propose a simple yet effective encoder-decoder network architecture consisting entirely of 3D convolutions that can be trained end-to-end using a standard cross-entropy loss. To this end, we leverage an efficient 3D encoder, and propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules. Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal dataset benchmarks in addition to being faster, thus showing that our architecture can efficiently learn expressive spatio-temporal features and produce high quality video segmentation masks. We have made our code and trained models publicly available at https://github.com/sabarim/3DC-Seg.
△ Less
Submitted 1 September, 2023; v1 submitted 26 August, 2020;
originally announced August 2020.
-
STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
Authors:
Ali Athar,
Sabarinath Mahadevan,
Aljoša Ošep,
Laura Leal-Taixé,
Bastian Leibe
Abstract:
Existing methods for instance segmentation in videos typically involve multi-stage pipelines that follow the tracking-by-detection paradigm and model a video clip as a sequence of images. Multiple networks are used to detect objects in individual frames, and then associate these detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In…
▽ More
Existing methods for instance segmentation in videos typically involve multi-stage pipelines that follow the tracking-by-detection paradigm and model a video clip as a sequence of images. Multiple networks are used to detect objects in individual frames, and then associate these detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In this paper, we propose a different approach that is well-suited to a variety of tasks involving instance segmentation in videos. In particular, we model a video clip as a single 3D spatio-temporal volume, and propose a novel approach that segments and tracks instances across space and time in a single stage. Our problem formulation is centered around the idea of spatio-temporal embeddings which are trained to cluster pixels belonging to a specific object instance over an entire video clip. To this end, we introduce (i) novel mixing functions that enhance the feature representation of spatio-temporal embeddings, and (ii) a single-stage, proposal-free network that can reason about temporal context. Our network is trained end-to-end to learn spatio-temporal embeddings as well as parameters required to cluster these embeddings, thus simplifying inference. Our method achieves state-of-the-art results across multiple datasets and tasks. Code and models are available at https://github.com/sabarim/STEm-Seg.
△ Less
Submitted 1 September, 2023; v1 submitted 18 March, 2020;
originally announced March 2020.
-
Microcontroller based automated life savior -- Medisûr
Authors:
Soumick Chatterjee,
Pramod George Jose,
Priyanka Basak,
Ambreen Athar,
Bindhya Aravind,
Romit S. Beed,
Rana Biswas
Abstract:
With the course of progress in the field of medicine, most of the patients lives can be saved. The only thing required is the proper attention at the proper time. Our wearable solution tries to solve this issue by taking the patients vitals and transmitting them to the server for live monitoring using the mobile app along with the patients current location. In case of an emergency, that is if any…
▽ More
With the course of progress in the field of medicine, most of the patients lives can be saved. The only thing required is the proper attention at the proper time. Our wearable solution tries to solve this issue by taking the patients vitals and transmitting them to the server for live monitoring using the mobile app along with the patients current location. In case of an emergency, that is if any vitals show any abnormalities, an SMS is sent to the caregiver of the patient with the patients location so that he can reach there on time.
△ Less
Submitted 4 December, 2019;
originally announced December 2019.
-
Multiple People Tracking Using Hierarchical Deep Tracklet Re-identification
Authors:
Maryam Babaee,
Ali Athar,
Gerhard Rigoll
Abstract:
The task of multiple people tracking in monocular videos is challenging because of the numerous difficulties involved: occlusions, varying environments, crowded scenes, camera parameters and motion. In the tracking-by-detection paradigm, most approaches adopt person re-identification techniques based on computing the pairwise similarity between detections. However, these techniques are less effect…
▽ More
The task of multiple people tracking in monocular videos is challenging because of the numerous difficulties involved: occlusions, varying environments, crowded scenes, camera parameters and motion. In the tracking-by-detection paradigm, most approaches adopt person re-identification techniques based on computing the pairwise similarity between detections. However, these techniques are less effective in handling long-term occlusions. By contrast, tracklet (a sequence of detections) re-identification can improve association accuracy since tracklets offer a richer set of visual appearance and spatio-temporal cues. In this paper, we propose a tracking framework that employs a hierarchical clustering mechanism for merging tracklets. To this end, tracklet re-identification is performed by utilizing a novel multi-stage deep network that can jointly reason about the visual appearance and spatio-temporal properties of a pair of tracklets, thereby providing a robust measure of affinity. Experimental results on the challenging MOT16 and MOT17 benchmarks show that our method significantly outperforms state-of-the-arts.
△ Less
Submitted 17 November, 2018; v1 submitted 9 November, 2018;
originally announced November 2018.
-
An Overview of Datatype Quantization Techniques for Convolutional Neural Networks
Authors:
Ali Athar
Abstract:
Convolutional Neural Networks (CNNs) are becoming increasingly popular due to their superior performance in the domain of computer vision, in applications such as objection detection and recognition. However, they demand complex, power-consuming hardware which makes them unsuitable for implementation on low-power mobile and embedded devices. In this paper, a description and comparison of various t…
▽ More
Convolutional Neural Networks (CNNs) are becoming increasingly popular due to their superior performance in the domain of computer vision, in applications such as objection detection and recognition. However, they demand complex, power-consuming hardware which makes them unsuitable for implementation on low-power mobile and embedded devices. In this paper, a description and comparison of various techniques is presented which aim to mitigate this problem. This is primarily achieved by quantizing the floating-point weights and activations to reduce the hardware requirements, and adapting the training and inference algorithms to maintain the network's performance.
△ Less
Submitted 22 August, 2018;
originally announced August 2018.
-
Urdu Word Segmentation using Conditional Random Fields (CRFs)
Authors:
Haris Bin Zia,
Agha Ali Raza,
Awais Athar
Abstract:
State-of-the-art Natural Language Processing algorithms rely heavily on efficient word segmentation. Urdu is amongst languages for which word segmentation is a complex task as it exhibits space omission as well as space insertion issues. This is partly due to the Arabic script which although cursive in nature, consists of characters that have inherent joining and non-joining attributes regardless…
▽ More
State-of-the-art Natural Language Processing algorithms rely heavily on efficient word segmentation. Urdu is amongst languages for which word segmentation is a complex task as it exhibits space omission as well as space insertion issues. This is partly due to the Arabic script which although cursive in nature, consists of characters that have inherent joining and non-joining attributes regardless of word boundary. This paper presents a word segmentation system for Urdu which uses a Conditional Random Field sequence modeler with orthographic, linguistic and morphological features. Our proposed model automatically learns to predict white space as word boundary as well as Zero Width Non-Joiner (ZWNJ) as sub-word boundary. Using a manually annotated corpus, our model achieves F1 score of 0.97 for word boundary identification and 0.85 for sub-word boundary identification tasks. We have made our code and corpus publicly available to make our results reproducible.
△ Less
Submitted 14 June, 2018;
originally announced June 2018.
-
PronouncUR: An Urdu Pronunciation Lexicon Generator
Authors:
Haris Bin Zia,
Agha Ali Raza,
Awais Athar
Abstract:
State-of-the-art speech recognition systems rely heavily on three basic components: an acoustic model, a pronunciation lexicon and a language model. To build these components, a researcher needs linguistic as well as technical expertise, which is a barrier in low-resource domains. Techniques to construct these three components without having expert domain knowledge are in great demand. Urdu, despi…
▽ More
State-of-the-art speech recognition systems rely heavily on three basic components: an acoustic model, a pronunciation lexicon and a language model. To build these components, a researcher needs linguistic as well as technical expertise, which is a barrier in low-resource domains. Techniques to construct these three components without having expert domain knowledge are in great demand. Urdu, despite having millions of speakers all over the world, is a low-resource language in terms of standard publically available linguistic resources. In this paper, we present a grapheme-to-phoneme conversion tool for Urdu that generates a pronunciation lexicon in a form suitable for use with speech recognition systems from a list of Urdu words. The tool predicts the pronunciation of words using a LSTM-based model trained on a handcrafted expert lexicon of around 39,000 words and shows an accuracy of 64% upon internal evaluation. For external evaluation on a speech recognition task, we obtain a word error rate comparable to one achieved using a fully handcrafted expert lexicon.
△ Less
Submitted 5 March, 2018; v1 submitted 1 January, 2018;
originally announced January 2018.