Search | arXiv e-print repository

Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo

Authors: Abhinaba Bala, Ashok Urlana, Rahul Mishra, Parameswari Krishnamurthy

Abstract: Obtaining sufficient information in one's mother tongue is crucial for satisfying the information needs of the users. While high-resource languages have abundant online resources, the situation is less than ideal for very low-resource languages. Moreover, the insufficient reporting of vital national and international events continues to be a worry, especially in languages with scarce resources, li… ▽ More Obtaining sufficient information in one's mother tongue is crucial for satisfying the information needs of the users. While high-resource languages have abundant online resources, the situation is less than ideal for very low-resource languages. Moreover, the insufficient reporting of vital national and international events continues to be a worry, especially in languages with scarce resources, like \textbf{Mizo}. In this paper, we conduct a study to investigate the effectiveness of a simple methodology designed to generate a holistic summary for Mizo news articles, which leverages English-language news to supplement and enhance the information related to the corresponding news events. Furthermore, we make available 500 Mizo news articles and corresponding enriched holistic summaries. Human evaluation confirms that our approach significantly enhances the information coverage of Mizo news articles. The mizo dataset and code can be accessed at \url{https://github.com/barvin04/mizo_enrichment △ Less

Submitted 25 April, 2024; originally announced May 2024.

Comments: Accepted at LREC-COLING2024 WILDRE Workshop

ACM Class: I.2.7

arXiv:2403.16592 [pdf, other]

TrustAI at SemEval-2024 Task 8: A Comprehensive Analysis of Multi-domain Machine Generated Text Detection Techniques

Authors: Ashok Urlana, Aditya Saibewar, Bala Mallikarjunarao Garlapati, Charaka Vinayak Kumar, Ajeet Kumar Singh, Srinivasa Rao Chalamala

Abstract: The Large Language Models (LLMs) exhibit remarkable ability to generate fluent content across a wide spectrum of user queries. However, this capability has raised concerns regarding misinformation and personal information leakage. In this paper, we present our methods for the SemEval2024 Task8, aiming to detect machine-generated text across various domains in both mono-lingual and multi-lingual co… ▽ More The Large Language Models (LLMs) exhibit remarkable ability to generate fluent content across a wide spectrum of user queries. However, this capability has raised concerns regarding misinformation and personal information leakage. In this paper, we present our methods for the SemEval2024 Task8, aiming to detect machine-generated text across various domains in both mono-lingual and multi-lingual contexts. Our study comprehensively analyzes various methods to detect machine-generated text, including statistical, neural, and pre-trained model approaches. We also detail our experimental setup and perform a in-depth error analysis to evaluate the effectiveness of these methods. Our methods obtain an accuracy of 86.9\% on the test set of subtask-A mono and 83.7\% for subtask-B. Furthermore, we also highlight the challenges and essential factors for consideration in future studies. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: 8 pages, 1 Figure

ACM Class: I.2.7

arXiv:2403.15529 [pdf, other]

LimGen: Probing the LLMs for Generating Suggestive Limitations of Research Papers

Authors: Abdur Rahman Bin Md Faizullah, Ashok Urlana, Rahul Mishra

Abstract: Examining limitations is a crucial step in the scholarly research reviewing process, revealing aspects where a study might lack decisiveness or require enhancement. This aids readers in considering broader implications for further research. In this article, we present a novel and challenging task of Suggestive Limitation Generation (SLG) for research papers. We compile a dataset called \textbf{\te… ▽ More Examining limitations is a crucial step in the scholarly research reviewing process, revealing aspects where a study might lack decisiveness or require enhancement. This aids readers in considering broader implications for further research. In this article, we present a novel and challenging task of Suggestive Limitation Generation (SLG) for research papers. We compile a dataset called \textbf{\textit{LimGen}}, encompassing 4068 research papers and their associated limitations from the ACL anthology. We investigate several approaches to harness large language models (LLMs) for producing suggestive limitations, by thoroughly examining the related challenges, practical insights, and potential opportunities. Our LimGen dataset and code can be accessed at \url{https://github.com/arbmf/LimGen}. △ Less

Submitted 14 June, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

Comments: Accepted at ECML-PKDD 2024

arXiv:2402.14558 [pdf, other]

LLMs with Industrial Lens: Deciphering the Challenges and Prospects -- A Survey

Authors: Ashok Urlana, Charaka Vinayak Kumar, Ajeet Kumar Singh, Bala Mallikarjunarao Garlapati, Srinivasa Rao Chalamala, Rahul Mishra

Abstract: Large language models (LLMs) have become the secret ingredient driving numerous industrial applications, showcasing their remarkable versatility across a diverse spectrum of tasks. From natural language processing and sentiment analysis to content generation and personalized recommendations, their unparalleled adaptability has facilitated widespread adoption across industries. This transformative… ▽ More Large language models (LLMs) have become the secret ingredient driving numerous industrial applications, showcasing their remarkable versatility across a diverse spectrum of tasks. From natural language processing and sentiment analysis to content generation and personalized recommendations, their unparalleled adaptability has facilitated widespread adoption across industries. This transformative shift driven by LLMs underscores the need to explore the underlying associated challenges and avenues for enhancement in their utilization. In this paper, our objective is to unravel and evaluate the obstacles and opportunities inherent in leveraging LLMs within an industrial context. To this end, we conduct a survey involving a group of industry practitioners, develop four research questions derived from the insights gathered, and examine 68 industry papers to address these questions and derive meaningful conclusions. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: 25 pages, 7 figures

arXiv:2312.14542 [pdf, other]

Automatic Data Retrieval for Cross Lingual Summarization

Authors: Nikhilesh Bhatnagar, Ashok Urlana, Vandan Mujadia, Pruthwik Mishra, Dipti Misra Sharma

Abstract: Cross-lingual summarization involves the summarization of text written in one language to a different one. There is a body of research addressing cross-lingual summarization from English to other European languages. In this work, we aim to perform cross-lingual summarization from English to Hindi. We propose pairing up the coverage of newsworthy events in textual and video format can prove to be h… ▽ More Cross-lingual summarization involves the summarization of text written in one language to a different one. There is a body of research addressing cross-lingual summarization from English to other European languages. In this work, we aim to perform cross-lingual summarization from English to Hindi. We propose pairing up the coverage of newsworthy events in textual and video format can prove to be helpful for data acquisition for cross lingual summarization. We analyze the data and propose methods to match articles to video descriptions that serve as document and summary pairs. We also outline filtering methods over reasonable thresholds to ensure the correctness of the summaries. Further, we make available 28,583 mono and cross-lingual article-summary pairs https://github.com/tingc9/Cross-Sum-News-Aligned. We also build and analyze multiple baselines on the collected data and report error analysis. △ Less

Submitted 22 December, 2023; originally announced December 2023.

Comments: 6 pages, 6 tables, 2 figures, conference: ICON 2023

arXiv:2311.09216 [pdf, other]

Assessing Translation capabilities of Large Language Models involving English and Indian Languages

Authors: Vandan Mujadia, Ashok Urlana, Yash Bhaskar, Penumalla Aditya Pavani, Kukkapalli Shravya, Parameswari Krishnamurthy, Dipti Misra Sharma

Abstract: Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. In this work, our aim is to explore the multilingual capabilities of large language models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large language models, followed by exploring the in-context learning c… ▽ More Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. In this work, our aim is to explore the multilingual capabilities of large language models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large language models, followed by exploring the in-context learning capabilities of the same raw models. We fine-tune these large language models using parameter efficient fine-tuning methods such as LoRA and additionally with full fine-tuning. Through our study, we have identified the best performing large language model for the translation task involving LLMs, which is based on LLaMA. Our results demonstrate significant progress, with average BLEU scores of 13.42, 15.93, 12.13, 12.30, and 12.07, as well as CHRF scores of 43.98, 46.99, 42.55, 42.42, and 45.39, respectively, using 2-stage fine-tuned LLaMA-13b for English to Indian languages on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Similarly, for Indian languages to English, we achieved average BLEU scores of 14.03, 16.65, 16.17, 15.35 and 12.55 along with chrF scores of 36.71, 40.44, 40.26, 39.51, and 36.20, respectively, using fine-tuned LLaMA-13b on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Overall, our findings highlight the potential and strength of large language models for machine translation capabilities, including for languages that are currently underrepresented in LLMs. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2311.09212 [pdf, other]

Controllable Text Summarization: Unraveling Challenges, Approaches, and Prospects -- A Survey

Authors: Ashok Urlana, Pruthwik Mishra, Tathagato Roy, Rahul Mishra

Abstract: Generic text summarization approaches often fail to address the specific intent and needs of individual users. Recently, scholarly attention has turned to the development of summarization methods that are more closely tailored and controlled to align with specific objectives and user needs. Despite a growing corpus of controllable summarization research, there is no comprehensive survey available… ▽ More Generic text summarization approaches often fail to address the specific intent and needs of individual users. Recently, scholarly attention has turned to the development of summarization methods that are more closely tailored and controlled to align with specific objectives and user needs. Despite a growing corpus of controllable summarization research, there is no comprehensive survey available that thoroughly explores the diverse controllable attributes employed in this context, delves into the associated challenges, and investigates the existing solutions. In this survey, we formalize the Controllable Text Summarization (CTS) task, categorize controllable attributes according to their shared characteristics and objectives, and present a thorough examination of existing datasets and methods within each category. Moreover, based on our findings, we uncover limitations and research gaps, while also exploring potential solutions and future directions for CTS. We release our detailed analysis of CTS papers at https://github.com/ashokurlana/controllable_text_summarization_survey. △ Less

Submitted 27 May, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

Comments: 21 pages, 6 figures, Accepted in ACL Findings 2024

ACM Class: I.2.7

arXiv:2305.08828 [pdf, other]

PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India

Authors: Ashok Urlana, Pinzhen Chen, Zheng Zhao, Shay B. Cohen, Manish Shrivastava, Barry Haddow

Abstract: This paper introduces PMIndiaSum, a multilingual and massively parallel summarization corpus focused on languages in India. Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs. We detail our construction workflow including data acquisition, processing, and quality assurance. Furthermore, we publish benchmarks f… ▽ More This paper introduces PMIndiaSum, a multilingual and massively parallel summarization corpus focused on languages in India. Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs. We detail our construction workflow including data acquisition, processing, and quality assurance. Furthermore, we publish benchmarks for monolingual, cross-lingual, and multilingual summarization by fine-tuning, prompting, as well as translate-and-summarize. Experimental results confirm the crucial role of our data in aiding summarization between Indian languages. Our dataset is publicly available and can be freely modified and re-distributed. △ Less

Submitted 19 October, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

Comments: Findings of EMNLP 2023

ACM Class: I.2.7

arXiv:2303.14461 [pdf, other]

Indian Language Summarization using Pretrained Sequence-to-Sequence Models

Authors: Ashok Urlana, Sahil Manoj Bhatt, Nirmal Surange, Manish Shrivastava

Abstract: The ILSUM shared task focuses on text summarization for two major Indian languages- Hindi and Gujarati, along with English. In this task, we experiment with various pretrained sequence-to-sequence models to find out the best model for each of the languages. We present a detailed overview of the models and our approaches in this paper. We secure the first rank across all three sub-tasks (English, H… ▽ More The ILSUM shared task focuses on text summarization for two major Indian languages- Hindi and Gujarati, along with English. In this task, we experiment with various pretrained sequence-to-sequence models to find out the best model for each of the languages. We present a detailed overview of the models and our approaches in this paper. We secure the first rank across all three sub-tasks (English, Hindi and Gujarati). This paper also extensively analyzes the impact of k-fold cross-validation while experimenting with limited data size, and we also perform various experiments with a combination of the original and a filtered version of the data to determine the efficacy of the pretrained models. △ Less

Submitted 25 March, 2023; originally announced March 2023.

Comments: Accepted at FIRE-2022, Indian Language Summarization (ILSUM) track

arXiv:2209.02391 [pdf, other]

Butterflies: A new source of inspiration for futuristic aerial robotics

Authors: Chakravarthi Jada, Lokesh Ch. R. S, Ashok Urlana, Shridi Swamy Yerubandi, Kantha Rao Bora, Gouse Basha Shaik, Pavan Baswani, Balaraju Karri

Abstract: Nature is an inhabitant for enormous number of species. All the species do perform complex activities with simple and elegant rules for their survival. The property of emergence of collective behavior is remarkably supporting their activities. One form of the collective behaviour is the swarm intelligence -- all agents poses same rules and capabilities. This equality along with local cooperation i… ▽ More Nature is an inhabitant for enormous number of species. All the species do perform complex activities with simple and elegant rules for their survival. The property of emergence of collective behavior is remarkably supporting their activities. One form of the collective behaviour is the swarm intelligence -- all agents poses same rules and capabilities. This equality along with local cooperation in the agents tremendously leads to achieving global results. Some of the swarm behaviours in the nature includes birds formations , fish school maneuverings, ants movement. Recently, one school of research has studied these behaviours and proposed artificial paradigms such as Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), Glowworm Swarm Optimization (GSO) etc. Another school of research used these models and designed robotic platforms to detect (locate) multiple signal sources such as light, fire, plume, odour etc. Kinbots platform is one such recent experiment. In the same line of thought, this extended abstract presents the recently proposed butterfly inspired metaphor and corresponding simulations, ongoing experiments with outcomes. △ Less

Submitted 24 August, 2022; originally announced September 2022.

Comments: 2 pages, 3 figures, Accepted as Late Breaking Report at ICRA 2017

Showing 1–10 of 10 results for author: Urlana, A