Search | arXiv e-print repository

Bridging the Gap: Unpacking the Hidden Challenges in Knowledge Distillation for Online Ranking Systems

Authors: Nikhil Khani, Shuo Yang, Aniruddh Nath, Yang Liu, Pendo Abbo, Li Wei, Shawn Andrews, Maciej Kula, Jarrod Kahn, Zhe Zhao, Lichan Hong, Ed Chi

Abstract: Knowledge Distillation (KD) is a powerful approach for compressing a large model into a smaller, more efficient model, particularly beneficial for latency-sensitive applications like recommender systems. However, current KD research predominantly focuses on Computer Vision (CV) and NLP tasks, overlooking unique data characteristics and challenges inherent to recommender systems. This paper address… ▽ More Knowledge Distillation (KD) is a powerful approach for compressing a large model into a smaller, more efficient model, particularly beneficial for latency-sensitive applications like recommender systems. However, current KD research predominantly focuses on Computer Vision (CV) and NLP tasks, overlooking unique data characteristics and challenges inherent to recommender systems. This paper addresses these overlooked challenges, specifically: (1) mitigating data distribution shifts between teacher and student models, (2) efficiently identifying optimal teacher configurations within time and budgetary constraints, and (3) enabling computationally efficient and rapid sharing of teacher labels to support multiple students. We present a robust KD system developed and rigorously evaluated on multiple large-scale personalized video recommendation systems within Google. Our live experiment results demonstrate significant improvements in student model performance while ensuring consistent and reliable generation of high quality teacher labels from a continuous data stream of data. △ Less

Submitted 26 August, 2024; originally announced August 2024.

arXiv:2406.11897 [pdf, other]

A Benchmark for Maximum Cut: Towards Standardization of the Evaluation of Learned Heuristics for Combinatorial Optimization

Authors: Ankur Nath, Alan Kuhnle

Abstract: Recently, there has been much work on the design of general heuristics for graph-based, combinatorial optimization problems via the incorporation of Graph Neural Networks (GNNs) to learn distribution-specific solution structures.However, there is a lack of consistency in the evaluation of these heuristics, in terms of the baselines and instances chosen, which makes it difficult to assess the relat… ▽ More Recently, there has been much work on the design of general heuristics for graph-based, combinatorial optimization problems via the incorporation of Graph Neural Networks (GNNs) to learn distribution-specific solution structures.However, there is a lack of consistency in the evaluation of these heuristics, in terms of the baselines and instances chosen, which makes it difficult to assess the relative performance of the algorithms. In this paper, we propose an open-source benchmark suite MaxCut-Bench dedicated to the NP-hard Maximum Cut problem in both its weighted and unweighted variants, based on a careful selection of instances curated from diverse graph datasets. The suite offers a unified interface to various heuristics, both traditional and machine learning-based. Next, we use the benchmark in an attempt to systematically corroborate or reproduce the results of several, popular learning-based approaches, including S2V-DQN [31], ECO-DQN [4], among others, in terms of three dimensions: objective value, generalization, and scalability. Our empirical results show that several of the learned heuristics fail to outperform a naive greedy algorithm, and that only one of them consistently outperforms Tabu Search, a simple, general heuristic based upon local search. Furthermore, we find that the performance of ECO-DQN remains the same or is improved if the GNN is replaced by a simple linear regression on a subset of the features that are related to Tabu Search. Code, data, and pretrained models are available at: \url{https://github.com/ankurnath/MaxCut-Bench}. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2405.05202 [pdf, other]

Discretely Beyond $1/e$: Guided Combinatorial Algorithms for Submodular Maximization

Authors: Yixin Chen, Ankur Nath, Chunli Peng, Alan Kuhnle

Abstract: For constrained, not necessarily monotone submodular maximization, all known approximation algorithms with ratio greater than $1/e$ require continuous ideas, such as queries to the multilinear extension of a submodular function and its gradient, which are typically expensive to simulate with the original set function. For combinatorial algorithms, the best known approximation ratios for both size… ▽ More For constrained, not necessarily monotone submodular maximization, all known approximation algorithms with ratio greater than $1/e$ require continuous ideas, such as queries to the multilinear extension of a submodular function and its gradient, which are typically expensive to simulate with the original set function. For combinatorial algorithms, the best known approximation ratios for both size and matroid constraint are obtained by a simple randomized greedy algorithm of Buchbinder et al. [9]: $1/e \approx 0.367$ for size constraint and $0.281$ for the matroid constraint in $\mathcal O (kn)$ queries, where $k$ is the rank of the matroid. In this work, we develop the first combinatorial algorithms to break the $1/e$ barrier: we obtain approximation ratio of $0.385$ in $\mathcal O (kn)$ queries to the submodular set function for size constraint, and $0.305$ for a general matroid constraint. These are achieved by guiding the randomized greedy algorithm with a fast local search algorithm. Further, we develop deterministic versions of these algorithms, maintaining the same ratio and asymptotic time complexity. Finally, we develop a deterministic, nearly linear time algorithm with ratio $0.377$. △ Less

Submitted 22 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

arXiv:2404.08949 [pdf, other]

Multimodal Cross-Document Event Coreference Resolution Using Linear Semantic Transfer and Mixed-Modality Ensembles

Authors: Abhijnan Nath, Huma Jamil, Shafiuddin Rehan Ahmed, George Baker, Rahul Ghosh, James H. Martin, Nathaniel Blanchard, Nikhil Krishnaswamy

Abstract: Event coreference resolution (ECR) is the task of determining whether distinct mentions of events within a multi-document corpus are actually linked to the same underlying occurrence. Images of the events can help facilitate resolution when language is ambiguous. Here, we propose a multimodal cross-document event coreference resolution method that integrates visual and textual cues with a simple l… ▽ More Event coreference resolution (ECR) is the task of determining whether distinct mentions of events within a multi-document corpus are actually linked to the same underlying occurrence. Images of the events can help facilitate resolution when language is ambiguous. Here, we propose a multimodal cross-document event coreference resolution method that integrates visual and textual cues with a simple linear map between vision and language models. As existing ECR benchmark datasets rarely provide images for all event mentions, we augment the popular ECB+ dataset with event-centric images scraped from the internet and generated using image diffusion models. We establish three methods that incorporate images and text for coreference: 1) a standard fused model with finetuning, 2) a novel linear mapping method without finetuning and 3) an ensembling approach based on splitting mention pairs by semantic and discourse-level difficulty. We evaluate on 2 datasets: the augmented ECB+, and AIDA Phase 1. Our ensemble systems using cross-modal linear mapping establish an upper limit (91.9 CoNLL F1) on ECB+ ECR performance given the preprocessing assumptions used, and establish a novel baseline on AIDA Phase 1. Our results demonstrate the utility of multimodal information in ECR for certain challenging coreference problems, and highlight a need for more multimodal resources in the coreference resolution space. △ Less

Submitted 13 April, 2024; originally announced April 2024.

Comments: To appear at LREC-COLING 2024

arXiv:2404.04299 [pdf, other]

GENEVIC: GENetic data Exploration and Visualization via Intelligent interactive Console

Authors: Anindita Nath, Savannah Mwesigwa, Yulin Dai, Xiaoqian Jiang, Zhongming Zhao

Abstract: Summary: The vast generation of genetic data poses a significant challenge in efficiently uncovering valuable knowledge. Introducing GENEVIC, an AI-driven chat framework that tackles this challenge by bridging the gap between genetic data generation and biomedical knowledge discovery. Leveraging generative AI, notably ChatGPT, it serves as a biologist's 'copilot'. It automates the analysis, retrie… ▽ More Summary: The vast generation of genetic data poses a significant challenge in efficiently uncovering valuable knowledge. Introducing GENEVIC, an AI-driven chat framework that tackles this challenge by bridging the gap between genetic data generation and biomedical knowledge discovery. Leveraging generative AI, notably ChatGPT, it serves as a biologist's 'copilot'. It automates the analysis, retrieval, and visualization of customized domain-specific genetic information, and integrates functionalities to generate protein interaction networks, enrich gene sets, and search scientific literature from PubMed, Google Scholar, and arXiv, making it a comprehensive tool for biomedical research. In its pilot phase, GENEVIC is assessed using a curated database that ranks genetic variants associated with Alzheimer's disease, schizophrenia, and cognition, based on their effect weights from the Polygenic Score Catalog, thus enabling researchers to prioritize genetic variants in complex diseases. GENEVIC's operation is user-friendly, accessible without any specialized training, secured by Azure OpenAI's HIPAA-compliant infrastructure, and evaluated for its efficacy through real-time query testing. As a prototype, GENEVIC is set to advance genetic research, enabling informed biomedical decisions. Availability and implementation: GENEVIC is publicly accessible at https://genevic-anath2024.streamlit.app. The underlying code is open-source and available via GitHub at https://github.com/anath2110/GENEVIC.git. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2404.03196 [pdf, other]

Okay, Let's Do This! Modeling Event Coreference with Generated Rationales and Knowledge Distillation

Authors: Abhijnan Nath, Shadi Manafi, Avyakta Chelle, Nikhil Krishnaswamy

Abstract: In NLP, Event Coreference Resolution (ECR) is the task of connecting event clusters that refer to the same underlying real-life event, usually via neural systems. In this work, we investigate using abductive free-text rationales (FTRs) generated by modern autoregressive LLMs as distant supervision of smaller student models for cross-document coreference (CDCR) of events. We implement novel rationa… ▽ More In NLP, Event Coreference Resolution (ECR) is the task of connecting event clusters that refer to the same underlying real-life event, usually via neural systems. In this work, we investigate using abductive free-text rationales (FTRs) generated by modern autoregressive LLMs as distant supervision of smaller student models for cross-document coreference (CDCR) of events. We implement novel rationale-oriented event clustering and knowledge distillation methods for event coreference scoring that leverage enriched information from the FTRs for improved CDCR without additional annotation or expensive document clustering. Our model using coreference specific knowledge distillation achieves SOTA B3 F1 on the ECB+ and GVC corpora and we establish a new baseline on the AIDA Phase 1 corpus. Our code can be found at https://github.com/csu-signal/llama_cdcr △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: To be published in NAACL 2024 Main

arXiv:2310.19990 [pdf, other]

Unveiling the Limits of Learned Local Search Heuristics: Are You the Mightiest of the Meek?

Authors: Ankur Nath, Alan Kuhnle

Abstract: In recent years, combining neural networks with local search heuristics has become popular in the field of combinatorial optimization. Despite its considerable computational demands, this approach has exhibited promising outcomes with minimal manual engineering. However, we have identified three critical limitations in the empirical evaluation of these integration attempts. Firstly, instances with… ▽ More In recent years, combining neural networks with local search heuristics has become popular in the field of combinatorial optimization. Despite its considerable computational demands, this approach has exhibited promising outcomes with minimal manual engineering. However, we have identified three critical limitations in the empirical evaluation of these integration attempts. Firstly, instances with moderate complexity and weak baselines pose a challenge in accurately evaluating the effectiveness of learning-based approaches. Secondly, the absence of an ablation study makes it difficult to quantify and attribute improvements accurately to the deep learning architecture. Lastly, the generalization of learned heuristics across diverse distributions remains underexplored. In this study, we conduct a comprehensive investigation into these identified limitations. Surprisingly, we demonstrate that a simple learned heuristic based on Tabu Search surpasses state-of-the-art (SOTA) learned heuristics in terms of performance and generalizability. Our findings challenge prevailing assumptions and open up exciting avenues for future research and innovation in combinatorial optimization. △ Less

Submitted 30 October, 2023; originally announced October 2023.

arXiv:2306.05434 [pdf, other]

How Good is the Model in Model-in-the-loop Event Coreference Resolution Annotation?

Authors: Shafiuddin Rehan Ahmed, Abhijnan Nath, Michael Regan, Adam Pollins, Nikhil Krishnaswamy, James H. Martin

Abstract: Annotating cross-document event coreference links is a time-consuming and cognitively demanding task that can compromise annotation quality and efficiency. To address this, we propose a model-in-the-loop annotation approach for event coreference resolution, where a machine learning model suggests likely corefering event pairs only. We evaluate the effectiveness of this approach by first simulating… ▽ More Annotating cross-document event coreference links is a time-consuming and cognitively demanding task that can compromise annotation quality and efficiency. To address this, we propose a model-in-the-loop annotation approach for event coreference resolution, where a machine learning model suggests likely corefering event pairs only. We evaluate the effectiveness of this approach by first simulating the annotation process and then, using a novel annotator-centric Recall-Annotation effort trade-off metric, we compare the results of various underlying models and datasets. We finally present a method for obtaining 97\% recall while substantially reducing the workload required by a fully manual annotation process. Code and data can be found at https://github.com/ahmeshaf/model_in_coref △ Less

Submitted 6 June, 2023; originally announced June 2023.

Comments: The 17th Liguistics Annotation Workshop, 2023 (LAW-XVII) short paper. 10 pages, 6 figures, 1 table

arXiv:2305.13641 [pdf, other]

AxomiyaBERTa: A Phonologically-aware Transformer Model for Assamese

Authors: Abhijnan Nath, Sheikh Mannan, Nikhil Krishnaswamy

Abstract: Despite their successes in NLP, Transformer-based language models still require extensive computing resources and suffer in low-resource or low-compute settings. In this paper, we present AxomiyaBERTa, a novel BERT model for Assamese, a morphologically-rich low-resource language (LRL) of Eastern India. AxomiyaBERTa is trained only on the masked language modeling (MLM) task, without the typical add… ▽ More Despite their successes in NLP, Transformer-based language models still require extensive computing resources and suffer in low-resource or low-compute settings. In this paper, we present AxomiyaBERTa, a novel BERT model for Assamese, a morphologically-rich low-resource language (LRL) of Eastern India. AxomiyaBERTa is trained only on the masked language modeling (MLM) task, without the typical additional next sentence prediction (NSP) objective, and our results show that in resource-scarce settings for very low-resource languages like Assamese, MLM alone can be successfully leveraged for a range of tasks. AxomiyaBERTa achieves SOTA on token-level tasks like Named Entity Recognition and also performs well on "longer-context" tasks like Cloze-style QA and Wiki Title Prediction, with the assistance of a novel embedding disperser and phonological signals respectively. Moreover, we show that AxomiyaBERTa can leverage phonological signals for even more challenging tasks, such as a novel cross-document coreference task on a translated version of the ECB+ corpus, where we present a new SOTA result for an LRL. Our source code and evaluation scripts may be found at https://github.com/csu-signal/axomiyaberta. △ Less

Submitted 22 May, 2023; originally announced May 2023.

Comments: 16 pages, 6 figures, 8 tables, appearing in Findings of the ACL: ACL 2023. This version compiled using pdfLaTeX-compatible Assamese script font. Assamese text may appear differently here than in official ACL 2023 proceedings

arXiv:2305.05672 [pdf, other]

$2 * n$ is better than $n^2$: Decomposing Event Coreference Resolution into Two Tractable Problems

Authors: Shafiuddin Rehan Ahmed, Abhijnan Nath, James H. Martin, Nikhil Krishnaswamy

Abstract: Event Coreference Resolution (ECR) is the task of linking mentions of the same event either within or across documents. Most mention pairs are not coreferent, yet many that are coreferent can be identified through simple techniques such as lemma matching of the event triggers or the sentences in which they appear. Existing methods for training coreference systems sample from a largely skewed distr… ▽ More Event Coreference Resolution (ECR) is the task of linking mentions of the same event either within or across documents. Most mention pairs are not coreferent, yet many that are coreferent can be identified through simple techniques such as lemma matching of the event triggers or the sentences in which they appear. Existing methods for training coreference systems sample from a largely skewed distribution, making it difficult for the algorithm to learn coreference beyond surface matching. Additionally, these methods are intractable because of the quadratic operations needed. To address these challenges, we break the problem of ECR into two parts: a) a heuristic to efficiently filter out a large number of non-coreferent pairs, and b) a training approach on a balanced set of coreferent and non-coreferent mention pairs. By following this approach, we show that we get comparable results to the state of the art on two popular ECR datasets while significantly reducing compute requirements. We also analyze the mention pairs that are "hard" to accurately classify as coreferent or non-coreferent. Code at https://github.com/ahmeshaf/lemma_ce_coref △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: Findings of the Association of Computational Linguistics, ACL 2023. 13 pages, 7 figures, 6 tables

arXiv:2107.06426 [pdf, other]

TSCAN : Dialog Structure discovery using SCAN

Authors: Apurba Nath, Aayush Kubba

Abstract: Can we discover dialog structure by dividing utterances into labelled clusters. Can these labels be generated from the data. Typically for dialogs we need an ontology and use that to discover structure, however by using unsupervised classification and self-labelling we are able to intuit this structure without any labels or ontology. In this paper we apply SCAN (Semantic Clustering using Nearest N… ▽ More Can we discover dialog structure by dividing utterances into labelled clusters. Can these labels be generated from the data. Typically for dialogs we need an ontology and use that to discover structure, however by using unsupervised classification and self-labelling we are able to intuit this structure without any labels or ontology. In this paper we apply SCAN (Semantic Clustering using Nearest Neighbors) to dialog data. We used BERT for pretext task and an adaptation of SCAN for clustering and self labeling. These clusters are used to identify transition probabilities and create the dialog structure. The self-labelling method used for SCAN makes these structures interpretable as every cluster has a label. As the approach is unsupervised, evaluation metrics is a challenge, we use statistical measures as proxies for structure quality △ Less

Submitted 18 July, 2021; v1 submitted 13 July, 2021; originally announced July 2021.

arXiv:2107.04681 [pdf]

A Survey on Personal Image Retrieval Systems

Authors: Amit Kumar Nath, Andy Wang

Abstract: The number of photographs taken worldwide is growing rapidly and steadily. While a small subset of these images is annotated and shared by users through social media platforms, due to the sheer number of images in personal photo repositories (shared or not shared), finding specific images remains challenging. This survey explores existing image retrieval techniques as well as photo-organizer appli… ▽ More The number of photographs taken worldwide is growing rapidly and steadily. While a small subset of these images is annotated and shared by users through social media platforms, due to the sheer number of images in personal photo repositories (shared or not shared), finding specific images remains challenging. This survey explores existing image retrieval techniques as well as photo-organizer applications to highlight their relative strengths in addressing this challenge. △ Less

Submitted 9 July, 2021; originally announced July 2021.

arXiv:2104.12141 [pdf, ps, other]

Coresets for $k$-median clustering under Fréchet and Hausdorff distances

Authors: Abhinandan Nath

Abstract: We give algorithms for computing coresets for $(1+\varepsilon)$-approximate $k$-median clustering of polygonal curves (under the discrete and continuous Fréchet distance) and point sets (under the Hausdorff distance), when the cluster centers are restricted to be of low complexity. Ours is the first such result, where the size of the coreset is independent of the number of input curves/point sets… ▽ More We give algorithms for computing coresets for $(1+\varepsilon)$-approximate $k$-median clustering of polygonal curves (under the discrete and continuous Fréchet distance) and point sets (under the Hausdorff distance), when the cluster centers are restricted to be of low complexity. Ours is the first such result, where the size of the coreset is independent of the number of input curves/point sets to be clustered (although it still depends on the maximum complexity of each input object). Specifically, the size of the coreset is $Θ\left(\frac{k^3lm^δd}{\varepsilon^2}\log\left( \frac{kl}{\varepsilon}\right)\right)$ for any $δ> 0$, where $d$ is the ambient dimension, $m$ is the maximum number of points in an input curve/point set, and $l$ is the maximum number of points allowed in a cluster center. We formally characterize a general condition on the restricted space of cluster centers -- this helps us to generalize and apply the importance sampling framework, that was used by Langberg and Schulman for computing coresets for $k$-median clustering of $d$-dimensional points on normed spaces in $\mathbb{R}^d$, to the problem of clustering curves and point sets using the Fréchet and Hausdorff metrics. Roughly, the condition places an upper bound on the number of different combinations of metric balls that the restricted space of cluster centers can hit. We also derive lower bounds on the size of the coreset, given the restriction that the coreset must be a subset of the input objects. △ Less

Submitted 25 April, 2021; originally announced April 2021.

arXiv:2012.11532 [pdf, other]

Dual-CyCon Net: A Cycle Consistent Dual-Domain Convolutional Neural Network Framework for Detection of Partial Discharge

Authors: Mohammad Zunaed, Ankur Nath, Md. Saifur Rahman

Abstract: In the last decade, researchers have been investigating the severity of insulation breakdown caused by partial discharge (PD) in overhead transmission lines with covered conductors or electrical equipment such as generators and motors used in various industries. Developing an effective partial discharge detection system can lead to significant savings on maintenance and prevent power disruptions.… ▽ More In the last decade, researchers have been investigating the severity of insulation breakdown caused by partial discharge (PD) in overhead transmission lines with covered conductors or electrical equipment such as generators and motors used in various industries. Developing an effective partial discharge detection system can lead to significant savings on maintenance and prevent power disruptions. Traditional methods rely on hand-crafted features and domain expertise to identify partial discharge patterns in the electrical current. Many data-driven deep learning-based methods have been proposed in recent years to remove these ad hoc feature extraction. However, most of these methods either operate in the time-domain or frequency-domain. Many research approaches have been developed to generate phase-resolved partial discharge (PRPD) patterns from raw PD sensor data. These PRPD diagrams suggest a correlation between partial discharge activities occurring in an alternating electrical waveform's positive and negative half-cycles. However, this correlation criterion between half-cycles has been remained unexplored in deep learning-based methods. This work proposes a novel feature-fusion-based Dual-CyCon Net that can utilize all time, frequency, and phase domain features for joint learning in one cohesive framework. Our proposed cycle-consistency loss exploits any relation between an alternating electrical signal's positive and negative half-cycles to calibrate the model's sensitivity. This loss explores cycle-invariant PD-specific features, enabling the model to learn more robust, noise-invariant features for PD detection. A case study of our proposed framework on a public real-world noisy measurement from high-frequency voltage sensors to detect damaged power lines has achieved a state-of-the-art MCC score of 0.8455. △ Less

Submitted 19 October, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

arXiv:2004.00722 [pdf, ps, other]

k-Median clustering under discrete Fréchet and Hausdorff distances

Authors: Abhinandan Nath, Erin Taylor

Abstract: We give the first near-linear time $(1+\eps)$-approximation algorithm for $k$-median clustering of polygonal trajectories under the discrete Fréchet distance, and the first polynomial time $(1+\eps)$-approximation algorithm for $k$-median clustering of finite point sets under the Hausdorff distance, provided the cluster centers, ambient dimension, and $k$ are bounded by a constant. The main techni… ▽ More We give the first near-linear time $(1+\eps)$-approximation algorithm for $k$-median clustering of polygonal trajectories under the discrete Fréchet distance, and the first polynomial time $(1+\eps)$-approximation algorithm for $k$-median clustering of finite point sets under the Hausdorff distance, provided the cluster centers, ambient dimension, and $k$ are bounded by a constant. The main technique is a general framework for solving clustering problems where the cluster centers are restricted to come from a \emph{simpler} metric space. We precisely characterize conditions on the simpler metric space of the cluster centers that allow faster $(1+\eps)$-approximations for the $k$-median problem. We also show that the $k$-median problem under Hausdorff distance is \textsc{NP-Hard}. △ Less

Submitted 1 April, 2020; originally announced April 2020.

Comments: A shorter version to appear in SoCG 2020

arXiv:1901.07715 [pdf, ps, other]

Enhancing MapReduce Fault Recovery Through Binocular Speculation

Authors: Huansong Fu, Yue Zhu, Amit Kumar Nath, Md. Muhib Khan, Weikuan Yu

Abstract: MapReduce speculation plays an important role in finding potential task stragglers and failures. But a tacit dichotomy exists in MapReduce due to its inherent two-phase (map and reduce) management scheme in which map tasks and reduce tasks have distinctly different execution behaviors, yet reduce tasks are dependent on the results of map tasks. We reveal that speculation policies for fault handlin… ▽ More MapReduce speculation plays an important role in finding potential task stragglers and failures. But a tacit dichotomy exists in MapReduce due to its inherent two-phase (map and reduce) management scheme in which map tasks and reduce tasks have distinctly different execution behaviors, yet reduce tasks are dependent on the results of map tasks. We reveal that speculation policies for fault handling in MapReduce do not recognize this dichotomy between map and reduce tasks, which leads to an issue of speculation myopia for MapReduce fault recovery. These issues cause significant performance degradation upon network and node failures. To address the speculation myopia caused by MapReduce dichotomy, we introduce a new scheme called binocular speculation to help MapReduce increase its assessment scope for speculation. As part of the scheme, we also design three component techniques including neighborhood glance, collective speculation and speculative rollback. Our evaluation shows that, with these techniques, binocular speculation can increase the coordination of map and reduce phases, and enhance the efficiency of MapReduce fault recovery. △ Less

Submitted 22 January, 2019; originally announced January 2019.

Comments: 10 pages, 9 figures

arXiv:1808.05827 [pdf]

Confidential Encrypted Data Hiding and Retrieval Using QR Authentication System

Authors: Somdip Dey, Asoke Nath, Shalabh Agarwal

Abstract: Now, security and authenticity of data is a big challenge. To solve this problem, we propose an innovative method to authenticate the digital documents. In this paper, we propose a new method, where the marks obtained by a candidate will also be encoded in QR CodeTM in encrypted form, so that if an intruder tries to change the marks in the mark sheet then he can not do that in the QR CodeTM, becau… ▽ More Now, security and authenticity of data is a big challenge. To solve this problem, we propose an innovative method to authenticate the digital documents. In this paper, we propose a new method, where the marks obtained by a candidate will also be encoded in QR CodeTM in encrypted form, so that if an intruder tries to change the marks in the mark sheet then he can not do that in the QR CodeTM, because the encryption key is unknown to him. In this method, we encrypt the mark sheet data using the TTJSA encryption algorithm. The encrypted marks are entered inside QR code and that QR code is also printed with the original data of the mark sheet. The marks can then be retrieved from the QR code and can be decrypted using TTJSA decryption algorithm and then it can be verified with marks already there in the mark sheet. △ Less

Submitted 17 August, 2018; originally announced August 2018.

Journal ref: 2013 International Conference on Communication Systems and Network Technologies

arXiv:1509.05751 [pdf, ps, other]

Computing the Gromov-Hausdorff Distance for Metric Trees

Authors: Pankaj K. Agarwal, Kyle Fox, Abhinandan Nath, Anastasios Sidiropoulos, Yusu Wang

Abstract: The Gromov-Hausdorff (GH) distance is a natural way to measure distance between two metric spaces. We prove that it is $\mathrm{NP}$-hard to approximate the Gromov-Hausdorff distance better than a factor of $3$ for geodesic metrics on a pair of trees. We complement this result by providing a polynomial time $O(\min\{n, \sqrt{rn}\})$-approximation algorithm for computing the GH distance between a p… ▽ More The Gromov-Hausdorff (GH) distance is a natural way to measure distance between two metric spaces. We prove that it is $\mathrm{NP}$-hard to approximate the Gromov-Hausdorff distance better than a factor of $3$ for geodesic metrics on a pair of trees. We complement this result by providing a polynomial time $O(\min\{n, \sqrt{rn}\})$-approximation algorithm for computing the GH distance between a pair of metric trees, where $r$ is the ratio of the longest edge length in both trees to the shortest edge length. For metric trees with unit length edges, this yields an $O(\sqrt{n})$-approximation algorithm. △ Less

Submitted 13 June, 2017; v1 submitted 18 September, 2015; originally announced September 2015.

Comments: Appeared in Proceedings of the 26th International Symposium on Algorithms and Computation

arXiv:1507.01698 [pdf, other]

Learning Tractable Probabilistic Models for Fault Localization

Authors: Aniruddh Nath, Pedro Domingos

Abstract: In recent years, several probabilistic techniques have been applied to various debugging problems. However, most existing probabilistic debugging systems use relatively simple statistical models, and fail to generalize across multiple programs. In this work, we propose Tractable Fault Localization Models (TFLMs) that can be learned from data, and probabilistically infer the location of the bug. Wh… ▽ More In recent years, several probabilistic techniques have been applied to various debugging problems. However, most existing probabilistic debugging systems use relatively simple statistical models, and fail to generalize across multiple programs. In this work, we propose Tractable Fault Localization Models (TFLMs) that can be learned from data, and probabilistically infer the location of the bug. While most previous statistical debugging methods generalize over many executions of a single program, TFLMs are trained on a corpus of previously seen buggy programs, and learn to identify recurring patterns of bugs. Widely-used fault localization techniques such as TARANTULA evaluate the suspiciousness of each line in isolation; in contrast, a TFLM defines a joint probability distribution over buggy indicator variables for each line. Joint distributions with rich dependency structure are often computationally intractable; TFLMs avoid this by exploiting recent developments in tractable probabilistic models (specifically, Relational SPNs). Further, TFLMs can incorporate additional sources of information, including coverage-based features such as TARANTULA. We evaluate the fault localization performance of TFLMs that include TARANTULA scores as features in the probabilistic model. Our study shows that the learned TFLMs isolate bugs more effectively than previous statistical methods or using TARANTULA directly. △ Less

Submitted 7 July, 2015; originally announced July 2015.

Comments: Fifth International Workshop on Statistical Relational AI (StaR-AI 2015)

Showing 1–19 of 19 results for author: Nath, A