Search | arXiv e-print repository

Can AI Assistance Aid in the Grading of Handwritten Answer Sheets?

Authors: Pritam Sil, Parag Chaudhuri, Bhaskaran Raman

Abstract: With recent advancements in artificial intelligence (AI), there has been growing interest in using state of the art (SOTA) AI solutions to provide assistance in grading handwritten answer sheets. While a few commercial products exist, the question of whether AI-assistance can actually reduce grading effort and time has not yet been carefully considered in published literature. This work introduces… ▽ More With recent advancements in artificial intelligence (AI), there has been growing interest in using state of the art (SOTA) AI solutions to provide assistance in grading handwritten answer sheets. While a few commercial products exist, the question of whether AI-assistance can actually reduce grading effort and time has not yet been carefully considered in published literature. This work introduces an AI-assisted grading pipeline. The pipeline first uses text detection to automatically detect question regions present in a question paper PDF. Next, it uses SOTA text detection methods to highlight important keywords present in the handwritten answer regions of scanned answer sheets to assist in the grading process. We then evaluate a prototype implementation of the AI-assisted grading pipeline deployed on an existing e-learning management platform. The evaluation involves a total of 5 different real-life examinations across 4 different courses at a reputed institute; it consists of a total of 42 questions, 17 graders, and 468 submissions. We log and analyze the grading time for each handwritten answer while using AI assistance and without it. Our evaluations have shown that, on average, the graders take 31% less time while grading a single response and 33% less grading time while grading a single answer sheet using AI assistance. △ Less

Submitted 23 August, 2024; originally announced August 2024.

arXiv:2408.04940 [pdf, other]

Capsule Vision 2024 Challenge: Multi-Class Abnormality Classification for Video Capsule Endoscopy

Authors: Palak Handa, Amirreza Mahbod, Florian Schwarzhans, Ramona Woitek, Nidhi Goel, Deepti Chhabra, Shreshtha Jha, Manas Dhir, Deepak Gunjan, Jagadeesh Kakarla, Balasubramanian Raman

Abstract: We present the Capsule Vision 2024 Challenge: Multi-Class Abnormality Classification for Video Capsule Endoscopy. It is being virtually organized by the Research Center for Medical Image Analysis and Artificial Intelligence (MIAAI), Department of Medicine, Danube Private University, Krems, Austria and Medical Imaging and Signal Analysis Hub (MISAHUB) in collaboration with the 9th International Con… ▽ More We present the Capsule Vision 2024 Challenge: Multi-Class Abnormality Classification for Video Capsule Endoscopy. It is being virtually organized by the Research Center for Medical Image Analysis and Artificial Intelligence (MIAAI), Department of Medicine, Danube Private University, Krems, Austria and Medical Imaging and Signal Analysis Hub (MISAHUB) in collaboration with the 9th International Conference on Computer Vision & Image Processing (CVIP 2024) being organized by the Indian Institute of Information Technology, Design and Manufacturing (IIITDM) Kancheepuram, Chennai, India. This document describes the overview of the challenge, its registration and rules, submission format, and the description of the utilized datasets. △ Less

Submitted 9 August, 2024; originally announced August 2024.

Comments: 6 pages

arXiv:2407.12818 [pdf, other]

"I understand why I got this grade": Automatic Short Answer Grading with Feedback

Authors: Dishank Aggarwal, Pushpak Bhattacharyya, Bhaskaran Raman

Abstract: The demand for efficient and accurate assessment methods has intensified as education systems transition to digital platforms. Providing feedback is essential in educational settings and goes beyond simply conveying marks as it justifies the assigned marks. In this context, we present a significant advancement in automated grading by introducing Engineering Short Answer Feedback (EngSAF) -- a data… ▽ More The demand for efficient and accurate assessment methods has intensified as education systems transition to digital platforms. Providing feedback is essential in educational settings and goes beyond simply conveying marks as it justifies the assigned marks. In this context, we present a significant advancement in automated grading by introducing Engineering Short Answer Feedback (EngSAF) -- a dataset of 5.8k student answers accompanied by reference answers and questions for the Automatic Short Answer Grading (ASAG) task. The EngSAF dataset is meticulously curated to cover a diverse range of subjects, questions, and answer patterns from multiple engineering domains. We leverage state-of-the-art large language models' (LLMs) generative capabilities with our Label-Aware Synthetic Feedback Generation (LASFG) strategy to include feedback in our dataset. This paper underscores the importance of enhanced feedback in practical educational settings, outlines dataset annotation and feedback generation processes, conducts a thorough EngSAF analysis, and provides different LLMs-based zero-shot and finetuned baselines for future comparison. Additionally, we demonstrate the efficiency and effectiveness of the ASAG system through its deployment in a real-world end-semester exam at the Indian Institute of Technology Bombay (IITB), showcasing its practical viability and potential for broader implementation in educational institutions. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2405.10256 [pdf, other]

Biasing & Debiasing based Approach Towards Fair Knowledge Transfer for Equitable Skin Analysis

Authors: Anshul Pundhir, Balasubramanian Raman, Pravendra Singh

Abstract: Deep learning models, particularly Convolutional Neural Networks (CNNs), have demonstrated exceptional performance in diagnosing skin diseases, often outperforming dermatologists. However, they have also unveiled biases linked to specific demographic traits, notably concerning diverse skin tones or gender, prompting concerns regarding fairness and limiting their widespread deployment. Researchers… ▽ More Deep learning models, particularly Convolutional Neural Networks (CNNs), have demonstrated exceptional performance in diagnosing skin diseases, often outperforming dermatologists. However, they have also unveiled biases linked to specific demographic traits, notably concerning diverse skin tones or gender, prompting concerns regarding fairness and limiting their widespread deployment. Researchers are actively working to ensure fairness in AI-based solutions, but existing methods incur an accuracy loss when striving for fairness. To solve this issue, we propose a `two-biased teachers' (i.e., biased on different sensitive attributes) based approach to transfer fair knowledge into the student network. Our approach mitigates biases present in the student network without harming its predictive accuracy. In fact, in most cases, our approach improves the accuracy of the baseline model. To achieve this goal, we developed a weighted loss function comprising biasing and debiasing loss terms. We surpassed available state-of-the-art approaches to attain fairness and also improved the accuracy at the same time. The proposed approach has been evaluated and validated on two dermatology datasets using standard accuracy and fairness evaluation measures. We will make source code publicly available to foster reproducibility and future research. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2403.02833 [pdf, other]

SOFIM: Stochastic Optimization Using Regularized Fisher Information Matrix

Authors: Mrinmay Sen, A. K. Qin, Gayathri C, Raghu Kishore N, Yen-Wei Chen, Balasubramanian Raman

Abstract: This paper introduces a new stochastic optimization method based on the regularized Fisher information matrix (FIM), named SOFIM, which can efficiently utilize the FIM to approximate the Hessian matrix for finding Newton's gradient update in large-scale stochastic optimization of machine learning models. It can be viewed as a variant of natural gradient descent, where the challenge of storing and… ▽ More This paper introduces a new stochastic optimization method based on the regularized Fisher information matrix (FIM), named SOFIM, which can efficiently utilize the FIM to approximate the Hessian matrix for finding Newton's gradient update in large-scale stochastic optimization of machine learning models. It can be viewed as a variant of natural gradient descent, where the challenge of storing and calculating the full FIM is addressed through making use of the regularized FIM and directly finding the gradient update direction via Sherman-Morrison matrix inversion. Additionally, like the popular Adam method, SOFIM uses the first moment of the gradient to address the issue of non-stationary objectives across mini-batches due to heterogeneous data. The utilization of the regularized FIM and Sherman-Morrison matrix inversion leads to the improved convergence rate with the same space and time complexities as stochastic gradient descent (SGD) with momentum. The extensive experiments on training deep learning models using several benchmark image classification datasets demonstrate that the proposed SOFIM outperforms SGD with momentum and several state-of-the-art Newton optimization methods in term of the convergence speed for achieving the pre-specified objectives of training and test losses as well as test accuracy. △ Less

Submitted 1 May, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

arXiv:2402.07640 [pdf, other]

CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis

Authors: Puneet Kumar, Sarthak Malik, Balasubramanian Raman, Xiaobai Li

Abstract: The Controllable Multimodal Feedback Synthesis (CMFeed) dataset enables the generation of sentiment-controlled feedback from multimodal inputs. It contains images, text, human comments, comments' metadata and sentiment labels. Existing datasets for related tasks such as multimodal summarization, visual question answering, visual dialogue, and sentiment-aware text generation do not incorporate trai… ▽ More The Controllable Multimodal Feedback Synthesis (CMFeed) dataset enables the generation of sentiment-controlled feedback from multimodal inputs. It contains images, text, human comments, comments' metadata and sentiment labels. Existing datasets for related tasks such as multimodal summarization, visual question answering, visual dialogue, and sentiment-aware text generation do not incorporate training models using human-generated outputs and their metadata, a gap that CMFeed addresses. This capability is critical for developing feedback systems that understand and replicate human-like spontaneous responses. Based on the CMFeed dataset, we define a novel task of controllable feedback synthesis to generate context-aware feedback aligned with the desired sentiment. We propose a benchmark feedback synthesis system comprising encoder, decoder, and controllability modules. It employs transformer and Faster R-CNN networks to extract features and generate sentiment-specific feedback, achieving a sentiment classification accuracy of 77.23%, which is 18.82% higher than models not leveraging the dataset's unique controllability features. Additionally, we incorporate a similarity module for relevance assessment through rank-based metrics. △ Less

Submitted 5 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2307.00324 [pdf, other]

DeepMediX: A Deep Learning-Driven Resource-Efficient Medical Diagnosis Across the Spectrum

Authors: Kishore Babu Nampalle, Pradeep Singh, Uppala Vivek Narayan, Balasubramanian Raman

Abstract: In the rapidly evolving landscape of medical imaging diagnostics, achieving high accuracy while preserving computational efficiency remains a formidable challenge. This work presents \texttt{DeepMediX}, a groundbreaking, resource-efficient model that significantly addresses this challenge. Built on top of the MobileNetV2 architecture, DeepMediX excels in classifying brain MRI scans and skin cancer… ▽ More In the rapidly evolving landscape of medical imaging diagnostics, achieving high accuracy while preserving computational efficiency remains a formidable challenge. This work presents \texttt{DeepMediX}, a groundbreaking, resource-efficient model that significantly addresses this challenge. Built on top of the MobileNetV2 architecture, DeepMediX excels in classifying brain MRI scans and skin cancer images, with superior performance demonstrated on both binary and multiclass skin cancer datasets. It provides a solution to labor-intensive manual processes, the need for large datasets, and complexities related to image properties. DeepMediX's design also includes the concept of Federated Learning, enabling a collaborative learning approach without compromising data privacy. This approach allows diverse healthcare institutions to benefit from shared learning experiences without the necessity of direct data access, enhancing the model's predictive power while preserving the privacy and integrity of sensitive patient data. Its low computational footprint makes DeepMediX suitable for deployment on handheld devices, offering potential for real-time diagnostic support. Through rigorous testing on standard datasets, including the ISIC2018 for dermatological research, DeepMediX demonstrates exceptional diagnostic capabilities, matching the performance of existing models on almost all tasks and even outperforming them in some cases. The findings of this study underline significant implications for the development and deployment of AI-based tools in medical imaging and their integration into point-of-care settings. The source code and models generated would be released at https://github.com/kishorebabun/DeepMediX. △ Less

Submitted 1 July, 2023; originally announced July 2023.

Comments: 23 pages, 3 figures, 4 tables, 1 algorithm

ACM Class: I.2.1

arXiv:2306.17794 [pdf, other]

Vision Through the Veil: Differential Privacy in Federated Learning for Medical Image Classification

Authors: Kishore Babu Nampalle, Pradeep Singh, Uppala Vivek Narayan, Balasubramanian Raman

Abstract: The proliferation of deep learning applications in healthcare calls for data aggregation across various institutions, a practice often associated with significant privacy concerns. This concern intensifies in medical image analysis, where privacy-preserving mechanisms are paramount due to the data being sensitive in nature. Federated learning, which enables cooperative model training without direc… ▽ More The proliferation of deep learning applications in healthcare calls for data aggregation across various institutions, a practice often associated with significant privacy concerns. This concern intensifies in medical image analysis, where privacy-preserving mechanisms are paramount due to the data being sensitive in nature. Federated learning, which enables cooperative model training without direct data exchange, presents a promising solution. Nevertheless, the inherent vulnerabilities of federated learning necessitate further privacy safeguards. This study addresses this need by integrating differential privacy, a leading privacy-preserving technique, into a federated learning framework for medical image classification. We introduce a novel differentially private federated learning model and meticulously examine its impacts on privacy preservation and model performance. Our research confirms the existence of a trade-off between model accuracy and privacy settings. However, we demonstrate that strategic calibration of the privacy budget in differential privacy can uphold robust image classification performance while providing substantial privacy protection. △ Less

Submitted 30 June, 2023; originally announced June 2023.

Comments: 18 pages, 3 figures, 1 table, 1 algorithm

MSC Class: 68U10 ACM Class: I.2.1

arXiv:2306.15574 [pdf, other]

See Through the Fog: Curriculum Learning with Progressive Occlusion in Medical Imaging

Authors: Pradeep Singh, Kishore Babu Nampalle, Uppala Vivek Narayan, Balasubramanian Raman

Abstract: In recent years, deep learning models have revolutionized medical image interpretation, offering substantial improvements in diagnostic accuracy. However, these models often struggle with challenging images where critical features are partially or fully occluded, which is a common scenario in clinical practice. In this paper, we propose a novel curriculum learning-based approach to train deep lear… ▽ More In recent years, deep learning models have revolutionized medical image interpretation, offering substantial improvements in diagnostic accuracy. However, these models often struggle with challenging images where critical features are partially or fully occluded, which is a common scenario in clinical practice. In this paper, we propose a novel curriculum learning-based approach to train deep learning models to handle occluded medical images effectively. Our method progressively introduces occlusion, starting from clear, unobstructed images and gradually moving to images with increasing occlusion levels. This ordered learning process, akin to human learning, allows the model to first grasp simple, discernable patterns and subsequently build upon this knowledge to understand more complicated, occluded scenarios. Furthermore, we present three novel occlusion synthesis methods, namely Wasserstein Curriculum Learning (WCL), Information Adaptive Learning (IAL), and Geodesic Curriculum Learning (GCL). Our extensive experiments on diverse medical image datasets demonstrate substantial improvements in model robustness and diagnostic accuracy over conventional training methodologies. △ Less

Submitted 30 June, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

Comments: 25 pages, 3 figures, 1 table (supplementary section added)

MSC Class: 68T05; 68T10; 92C55 ACM Class: I.5.1

arXiv:2305.15426 [pdf, other]

Transcending Grids: Point Clouds and Surface Representations Powering Neurological Processing

Authors: Kishore Babu Nampalle, Pradeep Singh, Vivek Narayan Uppala, Sumit Gangwar, Rajesh Singh Negi, Balasubramanian Raman

Abstract: In healthcare, accurately classifying medical images is vital, but conventional methods often hinge on medical data with a consistent grid structure, which may restrict their overall performance. Recent medical research has been focused on tweaking the architectures to attain better performance without giving due consideration to the representation of data. In this paper, we present a novel approa… ▽ More In healthcare, accurately classifying medical images is vital, but conventional methods often hinge on medical data with a consistent grid structure, which may restrict their overall performance. Recent medical research has been focused on tweaking the architectures to attain better performance without giving due consideration to the representation of data. In this paper, we present a novel approach for transforming grid based data into its higher dimensional representations, leveraging unstructured point cloud data structures. We first generate a sparse point cloud from an image by integrating pixel color information as spatial coordinates. Next, we construct a hypersurface composed of points based on the image dimensions, with each smooth section within this hypersurface symbolizing a specific pixel location. Polygonal face construction is achieved using an adjacency tensor. Finally, a dense point cloud is generated by densely sampling the constructed hypersurface, with a focus on regions of higher detail. The effectiveness of our approach is demonstrated on a publicly accessible brain tumor dataset, achieving significant improvements over existing classification techniques. This methodology allows the extraction of intricate details from the original image, opening up new possibilities for advanced image analysis and processing tasks. △ Less

Submitted 2 June, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

arXiv:2208.11868 [pdf, other]

Interpretable Multimodal Emotion Recognition using Hybrid Fusion of Speech and Image Data

Authors: Puneet Kumar, Sarthak Malik, Balasubramanian Raman

Abstract: This paper proposes a multimodal emotion recognition system based on hybrid fusion that classifies the emotions depicted by speech utterances and corresponding images into discrete classes. A new interpretability technique has been developed to identify the important speech & image features leading to the prediction of particular emotion classes. The proposed system's architecture has been determi… ▽ More This paper proposes a multimodal emotion recognition system based on hybrid fusion that classifies the emotions depicted by speech utterances and corresponding images into discrete classes. A new interpretability technique has been developed to identify the important speech & image features leading to the prediction of particular emotion classes. The proposed system's architecture has been determined through intensive ablation studies. It fuses the speech & image features and then combines speech, image, and intermediate fusion outputs. The proposed interpretability technique incorporates the divide & conquer approach to compute shapely values denoting each speech & image feature's importance. We have also constructed a large-scale dataset (IIT-R SIER dataset), consisting of speech utterances, corresponding images, and class labels, i.e., 'anger,' 'happy,' 'hate,' and 'sad.' The proposed system has achieved 83.29% accuracy for emotion recognition. The enhanced performance of the proposed system advocates the importance of utilizing complementary information from multiple modalities for emotion recognition. △ Less

Submitted 7 January, 2023; v1 submitted 25 August, 2022; originally announced August 2022.

Comments: arXiv admin note: text overlap with arXiv:2208.11450

arXiv:2208.11450 [pdf, other]

VISTANet: VIsual Spoken Textual Additive Net for Interpretable Multimodal Emotion Recognition

Authors: Puneet Kumar, Sarthak Malik, Balasubramanian Raman, Xiaobai Li

Abstract: This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTANet), to classify emotions reflected by input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has been developed that identifies important visual, spoken, and textual features leading to predicting a particular emot… ▽ More This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTANet), to classify emotions reflected by input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has been developed that identifies important visual, spoken, and textual features leading to predicting a particular emotion class. The VISTANet fuses information from image, speech, and text modalities using a hybrid of early and late fusion. It automatically adjusts the weights of their intermediate outputs while computing the weighted average. The KAAP technique computes the contribution of each modality and corresponding features toward predicting a particular emotion class. To mitigate the insufficiency of multimodal emotion datasets labeled with discrete emotion classes, we have constructed a large-scale IIT-R MMEmoRec dataset consisting of images, corresponding speech and text, and emotion labels ('angry,' 'happy,' 'hate,' and 'sad'). The VISTANet has resulted in 95.99% emotion recognition accuracy on the IIT-R MMEmoRec dataset using visual, audio, and textual modalities, outperforming when using any one or two modalities. The IIT-R MMEmoRec dataset can be accessed at https://github.com/MIntelligence-Group/MMEmoRec. △ Less

Submitted 26 May, 2024; v1 submitted 24 August, 2022; originally announced August 2022.

arXiv:2203.12692 [pdf, other]

Affective Feedback Synthesis Towards Multimodal Text and Image Data

Authors: Puneet Kumar, Gaurav Bhat, Omkar Ingle, Daksh Goyal, Balasubramanian Raman

Abstract: In this paper, we have defined a novel task of affective feedback synthesis that deals with generating feedback for input text & corresponding image in a similar way as humans respond towards the multimodal data. A feedback synthesis system has been proposed and trained using ground-truth human comments along with image-text input. We have also constructed a large-scale dataset consisting of image… ▽ More In this paper, we have defined a novel task of affective feedback synthesis that deals with generating feedback for input text & corresponding image in a similar way as humans respond towards the multimodal data. A feedback synthesis system has been proposed and trained using ground-truth human comments along with image-text input. We have also constructed a large-scale dataset consisting of image, text, Twitter user comments, and the number of likes for the comments by crawling the news articles through Twitter feeds. The proposed system extracts textual features using a transformer-based textual encoder while the visual features have been extracted using a Faster region-based convolutional neural networks model. The textual and visual features have been concatenated to construct the multimodal features using which the decoder synthesizes the feedback. We have compared the results of the proposed system with the baseline models using quantitative and qualitative measures. The generated feedbacks have been analyzed using automatic and human evaluation. They have been found to be semantically similar to the ground-truth comments and relevant to the given text-image input. △ Less

Submitted 31 March, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

Comments: Submitted to ACM Transactions on Multimedia Computing, Communications, and Applications

arXiv:2105.14512 [pdf, other]

SHELBRS: Location Based Recommendation Services using Switchable Homomorphic Encryption

Authors: Mishel Jain, Priyanka Singh, Balasubramanian Raman

Abstract: Location-Based Recommendation Services (LBRS) has seen an unprecedented rise in its usage in recent years. LBRS facilitates a user by recommending services based on his location and past preferences. However, leveraging such services comes at a cost of compromising one's sensitive information like their shopping preferences, lodging places, food habits, recently visited places, etc. to the third-p… ▽ More Location-Based Recommendation Services (LBRS) has seen an unprecedented rise in its usage in recent years. LBRS facilitates a user by recommending services based on his location and past preferences. However, leveraging such services comes at a cost of compromising one's sensitive information like their shopping preferences, lodging places, food habits, recently visited places, etc. to the third-party servers. Losing such information could be crucial and threatens one's privacy. Nowadays, the privacy-aware society seeks solutions that can provide such services, with minimized risks. Recently, a few privacy-preserving recommendation services have been proposed that exploit the fully homomorphic encryption (FHE) properties to address the issue. Though, it reduced privacy risks but suffered from heavy computational overheads that ruled out their commercial applications. Here, we propose SHELBRS, a lightweight LBRS that is based on switchable homomorphic encryption (SHE), which will benefit the users as well as the service providers. A SHE exploits both the additive as well as the multiplicative homomorphic properties but with comparatively much lesser processing time as it's FHE counterpart. We evaluate the performance of our proposed scheme with the other state-of-the-art approaches without compromising security. △ Less

Submitted 30 May, 2021; originally announced May 2021.

Comments: 9 pages, 6 figures, 3 tables

arXiv:2103.12523 [pdf, other]

Region extraction based approach for cigarette usage classification using deep learning

Authors: Anshul Pundhir, Deepak Verma, Puneet Kumar, Balasubramanian Raman

Abstract: This paper has proposed a novel approach to classify the subjects' smoking behavior by extracting relevant regions from a given image using deep learning. After the classification, we have proposed a conditional detection module based on Yolo-v3, which improves model's performance and reduces its complexity. As per the best of our knowledge, we are the first to work on this dataset. This dataset c… ▽ More This paper has proposed a novel approach to classify the subjects' smoking behavior by extracting relevant regions from a given image using deep learning. After the classification, we have proposed a conditional detection module based on Yolo-v3, which improves model's performance and reduces its complexity. As per the best of our knowledge, we are the first to work on this dataset. This dataset contains a total of 2,400 images that include smokers and non-smokers equally in various environmental settings. We have evaluated the proposed approach's performance using quantitative and qualitative measures, which confirms its effectiveness in challenging situations. The proposed approach has achieved a classification accuracy of 96.74% on this dataset. △ Less

Submitted 23 March, 2021; originally announced March 2021.

Comments: 5 pages, 16 figures. To appear in the proceedings of the 28th IEEE International Conference on Image Processing (IEEE - ICIP), September 19-22, 2021, Anchorage, Alaska, USA

arXiv:2011.08388 [pdf, other]

Interpretable Image Emotion Recognition: A Domain Adaptation Approach Using Facial Expressions

Authors: Puneet Kumar, Balasubramanian Raman

Abstract: This paper proposes a feature-based domain adaptation technique for identifying emotions in generic images, encompassing both facial and non-facial objects, as well as non-human components. This approach addresses the challenge of the limited availability of pre-trained models and well-annotated datasets for Image Emotion Recognition (IER). Initially, a deep-learning-based Facial Expression Recogn… ▽ More This paper proposes a feature-based domain adaptation technique for identifying emotions in generic images, encompassing both facial and non-facial objects, as well as non-human components. This approach addresses the challenge of the limited availability of pre-trained models and well-annotated datasets for Image Emotion Recognition (IER). Initially, a deep-learning-based Facial Expression Recognition (FER) system is developed, classifying facial images into discrete emotion classes. Maintaining the same network architecture, this FER system is then adapted to recognize emotions in generic images through the application of discrepancy loss, enabling the model to effectively learn IER features while classifying emotions into categories such as 'happy,' 'sad,' 'hate,' and 'anger.' Additionally, a novel interpretability method, Divide and Conquer based Shap (DnCShap), is introduced to elucidate the visual features most relevant for emotion recognition. The proposed IER system demonstrated emotion classification accuracies of 60.98% for the IAPSa dataset, 58.86% for the ArtPhoto dataset, 69.13% for the FI dataset, and 58.06% for the EMOTIC dataset. The system effectively identifies the important visual features leading to specific emotion classifications and provides detailed embedding plots to explain the predictions, enhancing the understanding and trust in AI-driven emotion recognition systems. △ Less

Submitted 28 August, 2024; v1 submitted 16 November, 2020; originally announced November 2020.

arXiv:2010.06200 [pdf, other]

End-to-end Triplet Loss based Emotion Embedding System for Speech Emotion Recognition

Authors: Puneet Kumar, Sidharth Jain, Balasubramanian Raman, Partha Pratim Roy, Masakazu Iwamura

Abstract: In this paper, an end-to-end neural embedding system based on triplet loss and residual learning has been proposed for speech emotion recognition. The proposed system learns the embeddings from the emotional information of the speech utterances. The learned embeddings are used to recognize the emotions portrayed by given speech samples of various lengths. The proposed system implements Residual Ne… ▽ More In this paper, an end-to-end neural embedding system based on triplet loss and residual learning has been proposed for speech emotion recognition. The proposed system learns the embeddings from the emotional information of the speech utterances. The learned embeddings are used to recognize the emotions portrayed by given speech samples of various lengths. The proposed system implements Residual Neural Network architecture. It is trained using softmax pre-training and triplet loss function. The weights between the fully connected and embedding layers of the trained network are used to calculate the embedding values. The embedding representations of various emotions are mapped onto a hyperplane, and the angles among them are computed using the cosine similarity. These angles are utilized to classify a new speech sample into its appropriate emotion class. The proposed system has demonstrated 91.67% and 64.44% accuracy while recognizing emotions for RAVDESS and IEMOCAP dataset, respectively. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: Accepted in ICPR 2020

arXiv:2009.02661 [pdf, other]

Computational Models for Academic Performance Estimation

Authors: Vipul Bansal, Himanshu Buckchash, Balasubramanian Raman

Abstract: Evaluation of students' performance for the completion of courses has been a major problem for both students and faculties during the work-from-home period in this COVID pandemic situation. To this end, this paper presents an in-depth analysis of deep learning and machine learning approaches for the formulation of an automated students' performance estimation system that works on partially availab… ▽ More Evaluation of students' performance for the completion of courses has been a major problem for both students and faculties during the work-from-home period in this COVID pandemic situation. To this end, this paper presents an in-depth analysis of deep learning and machine learning approaches for the formulation of an automated students' performance estimation system that works on partially available students' academic records. Our main contributions are (a) a large dataset with fifteen courses (shared publicly for academic research) (b) statistical analysis and ablations on the estimation problem for this dataset (c) predictive analysis through deep learning approaches and comparison with other arts and machine learning algorithms. Unlike previous approaches that rely on feature engineering or logical function deduction, our approach is fully data-driven and thus highly generic with better performance across different prediction tasks. △ Less

Submitted 6 September, 2020; originally announced September 2020.

Comments: 10 pages, 3 figures

arXiv:2007.05764 [pdf, ps, other]

Fast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis

Authors: Ankit Sharma, Puneet Kumar, Vikas Maddukuri, Nagasai Madamshettib, Kishore KG, Sahit Sai Sriram Kavurub, Balasubramanian Raman, Partha Pratim Roy

Abstract: The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications such as digital assistants, mobile phones, embedd… ▽ More The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications such as digital assistants, mobile phones, embedded devices, etc. The proposed approach applies Fast Griffin Lim Algorithm (FGLA) instead Griffin Lim algorithm (GLA) as vocoder in the speech synthesis phase. GLA and FGLA are both iterative, but the convergence rate of FGLA is faster than GLA. The proposed approach is tested on LJSpeech, Blizzard and Tatoeba datasets and the results for FGLA are compared against GLA and neural Generative Adversarial Network (GAN) based vocoder. The performance is evaluated based on synthesis delay and speech quality. A 36.58% reduction in speech synthesis delay has been observed. The quality of the output speech has improved, which is advocated by higher Mean opinion scores (MOS) and faster convergence with FGLA as opposed to GLA. △ Less

Submitted 11 July, 2020; originally announced July 2020.

Comments: Accepted for publication in Springer Multimedia Tools and Applications Journal

arXiv:1909.12948 [pdf, ps, other]

doi 10.1145/3347712

Video Skimming: Taxonomy and Comprehensive Survey

Authors: Vivekraj V. K., Debashis Sen, Balasubramanian Raman

Abstract: Video skimming, also known as dynamic video summarization, generates a temporally abridged version of a given video. Skimming can be achieved by identifying significant components either in uni-modal or multi-modal features extracted from the video. Being dynamic in nature, video skimming, through temporal connectivity, allows better understanding of the video from its summary. Having this obvious… ▽ More Video skimming, also known as dynamic video summarization, generates a temporally abridged version of a given video. Skimming can be achieved by identifying significant components either in uni-modal or multi-modal features extracted from the video. Being dynamic in nature, video skimming, through temporal connectivity, allows better understanding of the video from its summary. Having this obvious advantage, recently, video skimming has drawn the focus of many researchers benefiting from the easy availability of the required computing resources. In this paper, we provide a comprehensive survey on video skimming focusing on the substantial amount of literature from the past decade. We present a taxonomy of video skimming approaches, and discuss their evolution highlighting key advances. We also provide a study on the components required for the evaluation of a video skimming performance. △ Less

Submitted 21 September, 2019; originally announced September 2019.

Journal ref: ACM Computing Surveys (CSUR), Volume 52, Issue 5, 2019

arXiv:1811.00936 [pdf, other]

Acoustic Features Fusion using Attentive Multi-channel Deep Architecture

Authors: Gaurav Bhatt, Akshita Gupta, Aditya Arora, Balasubramanian Raman

Abstract: In this paper, we present a novel deep fusion architecture for audio classification tasks. The multi-channel model presented is formed using deep convolution layers where different acoustic features are passed through each channel. To enable dissemination of information across the channels, we introduce attention feature maps that aid in the alignment of frames. The output of each channel is merge… ▽ More In this paper, we present a novel deep fusion architecture for audio classification tasks. The multi-channel model presented is formed using deep convolution layers where different acoustic features are passed through each channel. To enable dissemination of information across the channels, we introduce attention feature maps that aid in the alignment of frames. The output of each channel is merged using interaction parameters that non-linearly aggregate the representative features. Finally, we evaluate the performance of the proposed architecture on three benchmark datasets:- DCASE-2016 and LITIS Rouen (acoustic scene recognition), and CHiME-Home (tagging). Our experimental results suggest that the architecture presented outperforms the standard baselines and achieves outstanding performance on the task of acoustic scene recognition and audio tagging. △ Less

Submitted 2 November, 2018; originally announced November 2018.

Comments: Accepted in CHiME'18 (Interspeech Workshop)

arXiv:1801.06792 [pdf, other]

Attentive Recurrent Tensor Model for Community Question Answering

Authors: Gaurav Bhatt, Shivam Sharma, Balasubramanian Raman

Abstract: A major challenge to the problem of community question answering is the lexical and semantic gap between the sentence representations. Some solutions to minimize this gap includes the introduction of extra parameters to deep models or augmenting the external handcrafted features. In this paper, we propose a novel attentive recurrent tensor network for solving the lexical and semantic gap in commun… ▽ More A major challenge to the problem of community question answering is the lexical and semantic gap between the sentence representations. Some solutions to minimize this gap includes the introduction of extra parameters to deep models or augmenting the external handcrafted features. In this paper, we propose a novel attentive recurrent tensor network for solving the lexical and semantic gap in community question answering. We introduce token-level and phrase-level attention strategy that maps input sequences to the output using trainable parameters. Further, we use the tensor parameters to introduce a 3-way interaction between question, answer and external features in vector space. We introduce simplified tensor matrices with L2 regularization that results in smooth optimization during training. The proposed model achieves state-of-the-art performance on the task of answer sentence selection (TrecQA and WikiQA datasets) while outperforming the current state-of-the-art on the tasks of best answer selection (Yahoo! L4) and answer triggering task (WikiQA). △ Less

Submitted 21 January, 2018; originally announced January 2018.

arXiv:1712.03935 [pdf, other]

On the Benefit of Combining Neural, Statistical and External Features for Fake News Identification

Authors: Gaurav Bhatt, Aman Sharma, Shivam Sharma, Ankush Nagpal, Balasubramanian Raman, Ankush Mittal

Abstract: Identifying the veracity of a news article is an interesting problem while automating this process can be a challenging task. Detection of a news article as fake is still an open question as it is contingent on many factors which the current state-of-the-art models fail to incorporate. In this paper, we explore a subtask to fake news identification, and that is stance detection. Given a news artic… ▽ More Identifying the veracity of a news article is an interesting problem while automating this process can be a challenging task. Detection of a news article as fake is still an open question as it is contingent on many factors which the current state-of-the-art models fail to incorporate. In this paper, we explore a subtask to fake news identification, and that is stance detection. Given a news article, the task is to determine the relevance of the body and its claim. We present a novel idea that combines the neural, statistical and external features to provide an efficient solution to this problem. We compute the neural embedding from the deep recurrent model, statistical features from the weighted n-gram bag-of-words model and handcrafted external features with the help of feature engineering heuristics. Finally, using deep neural layer all the features are combined, thereby classifying the headline-body news pair as agree, disagree, discuss, or unrelated. We compare our proposed technique with the current state-of-the-art models on the fake news challenge dataset. Through extensive experiments, we find that the proposed model outperforms all the state-of-the-art techniques including the submissions to the fake news challenge. △ Less

Submitted 11 December, 2017; originally announced December 2017.

Comments: Source code available at - www.deeplearn-ai.com

arXiv:1711.00003 [pdf, other]

Common Representation Learning Using Step-based Correlation Multi-Modal CNN

Authors: Gaurav Bhatt, Piyush Jha, Balasubramanian Raman

Abstract: Deep learning techniques have been successfully used in learning a common representation for multi-view data, wherein the different modalities are projected onto a common subspace. In a broader perspective, the techniques used to investigate common representation learning falls under the categories of canonical correlation-based approaches and autoencoder based approaches. In this paper, we invest… ▽ More Deep learning techniques have been successfully used in learning a common representation for multi-view data, wherein the different modalities are projected onto a common subspace. In a broader perspective, the techniques used to investigate common representation learning falls under the categories of canonical correlation-based approaches and autoencoder based approaches. In this paper, we investigate the performance of deep autoencoder based methods on multi-view data. We propose a novel step-based correlation multi-modal CNN (CorrMCNN) which reconstructs one view of the data given the other while increasing the interaction between the representations at each hidden layer or every intermediate step. Finally, we evaluate the performance of the proposed model on two benchmark datasets - MNIST and XRMB. Through extensive experiments, we find that the proposed model achieves better performance than the current state-of-the-art techniques on joint common representation learning and transfer learning tasks. △ Less

Submitted 31 October, 2017; originally announced November 2017.

Comments: Accepted in Asian Conference of Pattern Recognition (ACPR-2017)

arXiv:0906.4415 [pdf]

doi 10.1109/IADCC.2009.4809134

Robust Watermarking in Multiresolution Walsh-Hadamard Transform

Authors: Gaurav Bhatnagar, Balasubramanian Raman

Abstract: In this paper, a newer version of Walsh-Hadamard Transform namely multiresolution Walsh-Hadamard Transform (MR-WHT) is proposed for images. Further, a robust watermarking scheme is proposed for copyright protection using MRWHT and singular value decomposition. The core idea of the proposed scheme is to decompose an image using MR-WHT and then middle singular values of high frequency sub-band at… ▽ More In this paper, a newer version of Walsh-Hadamard Transform namely multiresolution Walsh-Hadamard Transform (MR-WHT) is proposed for images. Further, a robust watermarking scheme is proposed for copyright protection using MRWHT and singular value decomposition. The core idea of the proposed scheme is to decompose an image using MR-WHT and then middle singular values of high frequency sub-band at the coarsest and the finest level are modified with the singular values of the watermark. Finally, a reliable watermark extraction scheme is developed for the extraction of the watermark from the distorted image. The experimental results show better visual imperceptibility and resiliency of the proposed scheme against intentional or un-intentional variety of attacks. △ Less

Submitted 24 June, 2009; originally announced June 2009.

Comments: 6 Pages, 16 Figure, 2 Tables

Journal ref: Proc. of IEEE International Advance Computing Conference (IACC 2009), Patiala, India, 6-7 March 2009, pp. 894-899

Showing 1–25 of 25 results for author: Raman, B