Search | arXiv e-print repository

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Authors: Jian Chen, Vashisth Tiwari, Ranajoy Sadhukhan, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, Beidi Chen

Abstract: Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput. Speculative decoding (SD) is a widely used technique to reduce latency without sacrificing performance but the conventional wisdom suggests that its effic… ▽ More Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput. Speculative decoding (SD) is a widely used technique to reduce latency without sacrificing performance but the conventional wisdom suggests that its efficacy is limited to small batch sizes. In MagicDec, we show that surprisingly SD can achieve speedup even for a high throughput inference regime for moderate to long sequences. More interestingly, an intelligent drafting strategy can achieve better speedup with increasing batch size based on our rigorous analysis. MagicDec first identifies the bottleneck shifts with increasing batch size and sequence length, and uses these insights to deploy speculative decoding more effectively for high throughput inference. Then, it leverages draft models with sparse KV cache to address the KV bottleneck that scales with both sequence length and batch size. This finding underscores the broad applicability of speculative decoding in long-context serving, as it can enhance throughput and reduce latency without compromising accuracy. For moderate to long sequences, we demonstrate up to 2x speedup for LLaMA-2-7B-32K and 1.84x speedup for LLaMA-3.1-8B when serving batch sizes ranging from 32 to 256 on 8 NVIDIA A100 GPUs. The code is available at https://github.com/Infini-AI-Lab/MagicDec/. △ Less

Submitted 23 August, 2024; v1 submitted 20 August, 2024; originally announced August 2024.

arXiv:2406.00010 [pdf, other]

EnterpriseEM: Fine-tuned Embeddings for Enterprise Semantic Search

Authors: Kamalkumar Rathinasamy, Jayarama Nettar, Amit Kumar, Vishal Manchanda, Arun Vijayakumar, Ayush Kataria, Venkateshprasanna Manjunath, Chidambaram GS, Jaskirat Singh Sodhi, Shoeb Shaikh, Wasim Akhtar Khan, Prashant Singh, Tanishq Dattatray Ige, Vipin Tiwari, Rajab Ali Mondal, Harshini K, S Reka, Chetana Amancharla, Faiz ur Rahman, Harikrishnan P A, Indraneel Saha, Bhavya Tiwary, Navin Shankar Patel, Pradeep T S, Balaji A J , et al. (2 additional authors not shown)

Abstract: Enterprises grapple with the significant challenge of managing proprietary unstructured data, hindering efficient information retrieval. This has led to the emergence of AI-driven information retrieval solutions, designed to adeptly extract relevant insights to address employee inquiries. These solutions often leverage pre-trained embedding models and generative models as foundational components.… ▽ More Enterprises grapple with the significant challenge of managing proprietary unstructured data, hindering efficient information retrieval. This has led to the emergence of AI-driven information retrieval solutions, designed to adeptly extract relevant insights to address employee inquiries. These solutions often leverage pre-trained embedding models and generative models as foundational components. While pre-trained embeddings may exhibit proximity or disparity based on their original training objectives, they might not fully align with the unique characteristics of enterprise-specific data, leading to suboptimal alignment with the retrieval goals of enterprise environments. In this paper, we propose a methodology to fine-tune pre-trained embedding models specifically for enterprise environments. By adapting the embeddings to better suit the retrieval tasks prevalent in enterprises, we aim to enhance the performance of information retrieval solutions. We discuss the process of fine-tuning, its effect on retrieval accuracy, and the potential benefits for enterprise information management. Our findings demonstrate the efficacy of fine-tuned embedding models in improving the precision and relevance of search results in enterprise settings. △ Less

Submitted 18 May, 2024; originally announced June 2024.

ACM Class: I.2.7

arXiv:2401.00981 [pdf]

Machine Learning Classification of Alzheimer's Disease Stages Using Cerebrospinal Fluid Biomarkers Alone

Authors: Vivek Kumar Tiwari, Premananda Indic, Shawana Tabassum

Abstract: Early diagnosis of Alzheimer's disease is a challenge because the existing methodologies do not identify the patients in their preclinical stage, which can last up to a decade prior to the onset of clinical symptoms. Several research studies demonstrate the potential of cerebrospinal fluid biomarkers, amyloid beta 1-42, T-tau, and P-tau, in early diagnosis of Alzheimer's disease stages. In this wo… ▽ More Early diagnosis of Alzheimer's disease is a challenge because the existing methodologies do not identify the patients in their preclinical stage, which can last up to a decade prior to the onset of clinical symptoms. Several research studies demonstrate the potential of cerebrospinal fluid biomarkers, amyloid beta 1-42, T-tau, and P-tau, in early diagnosis of Alzheimer's disease stages. In this work, we used machine learning models to classify different stages of Alzheimer's disease based on the cerebrospinal fluid biomarker levels alone. An electronic health record of patients from the National Alzheimer's Coordinating Centre database was analyzed and the patients were subdivided based on mini-mental state scores and clinical dementia ratings. Statistical and correlation analyses were performed to identify significant differences between the Alzheimer's stages. Afterward, machine learning classifiers including K-Nearest Neighbors, Ensemble Boosted Tree, Ensemble Bagged Tree, Support Vector Machine, Logistic Regression, and Naive Bayes classifiers were employed to classify the Alzheimer's disease stages. The results demonstrate that Ensemble Boosted Tree (84.4%) and Logistic Regression (73.4%) provide the highest accuracy for binary classification, while Ensemble Bagged Tree (75.4%) demonstrates better accuracy for multiclassification. The findings from this research are expected to help clinicians in making an informed decision regarding the early diagnosis of Alzheimer's from the cerebrospinal fluid biomarkers alone, monitoring of the disease progression, and implementation of appropriate intervention measures. △ Less

Submitted 1 January, 2024; originally announced January 2024.

arXiv:2311.00991 [pdf, other]

IR-UWB Radar-based Situational Awareness System for Smartphone-Distracted Pedestrians

Authors: Jamsheed Manja Ppallan, Ruchi Pandey, Yellappa Damam, Vijay Narayan Tiwari, Karthikeyan Arunachalam, Antariksha Ray

Abstract: With the widespread adoption of smartphones, ensuring pedestrian safety on roads has become a critical concern due to smartphone distraction. This paper proposes a novel and real-time assistance system called UWB-assisted Safe Walk (UASW) for obstacle detection and warns users about real-time situations. The proposed method leverages Impulse Radio Ultra-Wideband (IR-UWB) radar embedded in the smar… ▽ More With the widespread adoption of smartphones, ensuring pedestrian safety on roads has become a critical concern due to smartphone distraction. This paper proposes a novel and real-time assistance system called UWB-assisted Safe Walk (UASW) for obstacle detection and warns users about real-time situations. The proposed method leverages Impulse Radio Ultra-Wideband (IR-UWB) radar embedded in the smartphone, which provides excellent range resolution and high noise resilience using short pulses. We implemented UASW specifically for Android smartphones with IR-UWB connectivity. The framework uses complex Channel Impulse Response (CIR) data to integrate rule-based obstacle detection with artificial neural network (ANN) based obstacle classification. The performance of the proposed UASW system is analyzed using real-time collected data. The results show that the proposed system achieves an obstacle detection accuracy of up to 97% and obstacle classification accuracy of up to 95% with an inference delay of 26.8 ms. The results highlight the effectiveness of UASW in assisting smartphone-distracted pedestrians and improving their situational awareness. △ Less

Submitted 2 November, 2023; originally announced November 2023.

arXiv:1910.14552 [pdf, other]

On the Interaction Between Deep Detectors and Siamese Trackers in Video Surveillance

Authors: Madhu Kiran, Vivek Tiwari, Le Thanh Nguyen-Meidine, Eric Granger

Abstract: Visual object tracking is an important function in many real-time video surveillance applications, such as localization and spatio-temporal recognition of persons. In real-world applications, an object detector and tracker must interact on a periodic basis to discover new objects, and thereby to initiate tracks. Periodic interactions with the detector can also allow the tracker to validate and/or… ▽ More Visual object tracking is an important function in many real-time video surveillance applications, such as localization and spatio-temporal recognition of persons. In real-world applications, an object detector and tracker must interact on a periodic basis to discover new objects, and thereby to initiate tracks. Periodic interactions with the detector can also allow the tracker to validate and/or update its object template with new bounding boxes. However, bounding boxes provided by a state-of-the-art detector are noisy, due to changes in appearance, background and occlusion, which can cause the tracker to drift. Moreover, CNN-based detectors can provide a high level of accuracy at the expense of computational complexity, so interactions should be minimized for real-time applications. In this paper, a new approach is proposed to manage detector-tracker interactions for trackers from the Siamese-FC family. By integrating a change detection mechanism into a deep Siamese-FC tracker, its template can be adapted in response to changes in a target's appearance that lead to drifts during tracking. An abrupt change detection triggers an update of tracker template using the bounding box produced by the detector, while in the case of a gradual change, the detector is used to update an evolving set of templates for robust matching. Experiments were performed using state-of-the-art Siamese-FC trackers and the YOLOv3 detector on a subset of videos from the OTB-100 dataset that mimic video surveillance scenarios. Results highlight the importance for reliable VOT of using accurate detectors. They also indicate that our adaptive Siamese trackers are robust to noisy object detections, and can significantly improve the performance of Siamese-FC tracking. △ Less

Submitted 31 October, 2019; originally announced October 2019.

Comments: Presented in AVSS-2019 Conference

arXiv:1905.13368 [pdf, other]

Fast Online "Next Best Offers" using Deep Learning

Authors: Rekha Singhal, Gautam Shroff, Mukund Kumar, Sharod Roy, Sanket Kadarkar, Rupinder virk, Siddharth Verma, Vartika Tiwari

Abstract: In this paper, we present iPrescribe, a scalable low-latency architecture for recommending 'next-best-offers' in an online setting. The paper presents the design of iPrescribe and compares its performance for implementations using different real-time streaming technology stacks. iPrescribe uses an ensemble of deep learning and machine learning algorithms for prediction. We describe the scalable re… ▽ More In this paper, we present iPrescribe, a scalable low-latency architecture for recommending 'next-best-offers' in an online setting. The paper presents the design of iPrescribe and compares its performance for implementations using different real-time streaming technology stacks. iPrescribe uses an ensemble of deep learning and machine learning algorithms for prediction. We describe the scalable real-time streaming technology stack and optimized machine-learning implementations to achieve a 90th percentile recommendation latency of 38 milliseconds. Optimizations include a novel mechanism to deploy recurrent Long Short Term Memory (LSTM) deep learning networks efficiently. △ Less

Submitted 30 May, 2019; originally announced May 2019.

Comments: 7 Pages, Accepted in COMAD-CODS 2019

arXiv:1006.4538 [pdf]

Computational Analysis of .NET Remoting and Mobile agent in Distributed Environment

Authors: Vivek Tiwari, G. Shailendra, Renu Tiwari, Malam Kirar

Abstract: A mobile agent is a program that is not bound to the system on which it began execution, but rather travels amongst the hosts in the network with its code and current execution state (i.e. Distributed Environment).The implementation of distributed applications can be based on a multiplicity of technologies, e.g. plain sockets, Remote Procedure Call (RPC), Remote Method Invocation (RMI), Java Messa… ▽ More A mobile agent is a program that is not bound to the system on which it began execution, but rather travels amongst the hosts in the network with its code and current execution state (i.e. Distributed Environment).The implementation of distributed applications can be based on a multiplicity of technologies, e.g. plain sockets, Remote Procedure Call (RPC), Remote Method Invocation (RMI), Java Message Service (JMS), .NET Remoting, or Web Services. These technologies differ widely in complexity, interoperability, standardization, and ease of use. The Mobile Agent technology is emerging as an alternative to build a smart generation of highly distributed systems. In this work, we investigate the performance aspect of agent-based technologies for information retrieval. We present a comparative performance evaluation model of Mobile Agents versus .Net remoting by means of an analytical approach. A quantitative measurements are performed to compare .Net remoting and mobile agents using communication time, code size (agent code), Data size, number of node as performance parameters in this research work. The results depict that Mobile Agent paradigm offers a superior performance compared to .Net remoting paradigm, offers fast computational speed; procure lower invocation cost by making local invocations instead of remote invocations over the network, thereby reducing network bandwidth. △ Less

Submitted 23 June, 2010; originally announced June 2010.

Comments: IEEE Publication Format, https://sites.google.com/site/journalofcomputing/

Journal ref: Journal of Computing, Vol. 2, No. 6, June 2010, NY, USA, ISSN 2151-9617

arXiv:1005.4030 [pdf]

Scope of cloud computing for SMEs in India

Authors: Monika Sharma, Ashwani Mehra, Haresh Jola, Anand Kumar, Madhvendra Misra, Vijayshri Tiwari

Abstract: Cloud computing is a set of services that provide infrastructure resources using internet media and data storage on a third party server. SMEs are said to be the lifeblood of any vibrant economy. They are known to be the silent drivers of a nation's economy. SMEs of India are one of the most aggressive adopters of ERP Packages. Most of the Indian SMEs have adopted the traditional ERP Systems and h… ▽ More Cloud computing is a set of services that provide infrastructure resources using internet media and data storage on a third party server. SMEs are said to be the lifeblood of any vibrant economy. They are known to be the silent drivers of a nation's economy. SMEs of India are one of the most aggressive adopters of ERP Packages. Most of the Indian SMEs have adopted the traditional ERP Systems and have incurred a heavy cost while implementing these systems. This paper presents the cost savings and reduction in the level of difficulty in adopting a cloud computing Service (CCS) enabled ERP system. For the study, IT people from 30 North Indian SMEs were interviewed. In the cloud computing environment the SMEs will not have to own the infrastructure so they can abstain from any capital expenditure and instead they can utilize the resources as a service and pay as per their usage. We consider the results of the paper to be supportive to our proposed research concept. △ Less

Submitted 21 May, 2010; originally announced May 2010.

Comments: http://www.journalofcomputing.org

Journal ref: Journal of Computing, Volume 2, Issue 5, May 2010

arXiv:1005.1904 [pdf]

Cloud Computing: Exploring the scope

Authors: Abhinav Pandey, Akash Pandey, Ankit Tandon, Brajesh Kr Maurya, Upendra Kushwaha, Dr. Madhvendra Mishra, Vijayshree Tiwari

Abstract: Cloud computing refers to a paradigm shift to overall IT solutions while raising the accessibility, scalability and effectiveness through its enabling technologies. However, migrated cloud platforms and services cost benefits as well as performances are neither clear nor summarized. Globalization and the recessionary economic times have not only raised the bar of a better IT delivery models but al… ▽ More Cloud computing refers to a paradigm shift to overall IT solutions while raising the accessibility, scalability and effectiveness through its enabling technologies. However, migrated cloud platforms and services cost benefits as well as performances are neither clear nor summarized. Globalization and the recessionary economic times have not only raised the bar of a better IT delivery models but also have given access to technology enabled services via internet. Cloud computing has vast potential in terms of lean Retail methodologies that can minimize the operational cost by using the third party based IT capabilities, as a service. It will not only increase the ROI but will also help in lowering the total cost of ownership. In this paper we have tried to compare the cloud computing cost benefits with the actual premise cost which an organization incurs normally. However, in spite of the cost benefits, many IT professional believe that the latest model i.e. "cloud computing" has risks and security concerns. This report demonstrates how to answer the following questions: (1) Idea behind cloud computing. (2) Monetary cost benefits of using cloud with respect to traditional premise computing. (3) What are the various security issues? We have tried to find out the cost benefit by comparing the Microsoft Azure cloud cost with the prevalent premise cost. △ Less

Submitted 20 May, 2010; v1 submitted 11 May, 2010; originally announced May 2010.

Comments: 9 pages, 7 figures, Paper accepted for the 2010 International Conference on Informatics, Cybernetics, and Computer Applications (ICICCA 2010)

Showing 1–9 of 9 results for author: Tiwari, V