Search | arXiv e-print repository

doi 10.1145/3626772.3661355

Synthetic Query Generation using Large Language Models for Virtual Assistants

Authors: Sonal Sannigrahi, Thiago Fraga-Silva, Youssef Oualil, Christophe Van Gysel

Abstract: Virtual Assistants (VAs) are important Information Retrieval platforms that help users accomplish various tasks through spoken commands. The speech recognition system (speech-to-text) uses query priors, trained solely on text, to distinguish between phonetically confusing alternatives. Hence, the generation of synthetic queries that are similar to existing VA usage can greatly improve upon the VA'… ▽ More Virtual Assistants (VAs) are important Information Retrieval platforms that help users accomplish various tasks through spoken commands. The speech recognition system (speech-to-text) uses query priors, trained solely on text, to distinguish between phonetically confusing alternatives. Hence, the generation of synthetic queries that are similar to existing VA usage can greatly improve upon the VA's abilities -- especially for use-cases that do not (yet) occur in paired audio/text data. In this paper, we provide a preliminary exploration of the use of Large Language Models (LLMs) to generate synthetic queries that are complementary to template-based methods. We investigate whether the methods (a) generate queries that are similar to randomly sampled, representative, and anonymized user queries from a popular VA, and (b) whether the generated queries are specific. We find that LLMs generate more verbose queries, compared to template-based methods, and reference aspects specific to the entity. The generated queries are similar to VA user queries, and are specific enough to retrieve the relevant entity. We conclude that queries generated by LLMs and templates are complementary. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: SIGIR '24. The 47th International ACM SIGIR Conference on Research & Development in Information Retrieval

arXiv:2305.03207 [pdf, other]

Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages

Authors: Sonal Sannigrahi, Rachel Bawden

Abstract: Multilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati,… ▽ More Multilingual language models have shown impressive cross-lingual transfer ability across a diverse set of languages and tasks. To improve the cross-lingual ability of these models, some strategies include transliteration and finer-grained segmentation into characters as opposed to subwords. In this work, we investigate lexical sharing in multilingual machine translation (MT) from Hindi, Gujarati, Nepali into English. We explore the trade-offs that exist in translation performance between data sampling and vocabulary size, and we explore whether transliteration is useful in encouraging cross-script generalisation. We also verify how the different settings generalise to unseen languages (Marathi and Bengali). We find that transliteration does not give pronounced improvements and our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences even for relatively low-resource languages △ Less

Submitted 4 May, 2023; originally announced May 2023.

Comments: EAMT main conference

arXiv:2304.14796 [pdf, other]

Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?

Authors: Sonal Sannigrahi, Josef van Genabith, Cristina Espana-Bonet

Abstract: Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art results in various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back… ▽ More Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art results in various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back on computing document embeddings based on sentence representations. Although there exist architectures and models to encode documents fully, they are in general limited to English and few other high-resourced languages. In this work, we provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare input token number truncation, sentence averaging as well as some simple windowing and in some cases new augmented and learnable approaches, on 3 multi- and cross-lingual tasks in 8 languages belonging to 3 different language families. Our task-based extrinsic evaluations show that, independently of the language, a clever combination of sentence embeddings is usually better than encoding the full document as a single unit, even when this is possible. We demonstrate that while a simple sentence average results in a strong baseline for classification tasks, more complex combinations are necessary for semantic tasks. △ Less

Submitted 28 April, 2023; originally announced April 2023.

Comments: EACL 2023 Findings paper, to present at LoResMT

arXiv:2203.14632 [pdf, other]

Isomorphic Cross-lingual Embeddings for Low-Resource Languages

Authors: Sonal Sannigrahi, Jesse Read

Abstract: Cross-Lingual Word Embeddings (CLWEs) are a key component to transfer linguistic information learnt from higher-resource settings into lower-resource ones. Recent research in cross-lingual representation learning has focused on offline mapping approaches due to their simplicity, computational efficacy, and ability to work with minimal parallel resources. However, they crucially depend on the assum… ▽ More Cross-Lingual Word Embeddings (CLWEs) are a key component to transfer linguistic information learnt from higher-resource settings into lower-resource ones. Recent research in cross-lingual representation learning has focused on offline mapping approaches due to their simplicity, computational efficacy, and ability to work with minimal parallel resources. However, they crucially depend on the assumption of embedding spaces being approximately isomorphic i.e. sharing similar geometric structure, which does not hold in practice, leading to poorer performance on low-resource and distant language pairs. In this paper, we introduce a framework to learn CLWEs, without assuming isometry, for low-resource pairs via joint exploitation of a related higher-resource language. In our work, we first pre-align the low-resource and related language embedding spaces using offline methods to mitigate the assumption of isometry. Following this, we use joint training methods to develops CLWEs for the related language and the target embed-ding space. Finally, we remap the pre-aligned low-resource space and the target space to generate the final CLWEs. We show consistent gains over current methods in both quality and degree of isomorphism, as measured by bilingual lexicon induction (BLI) and eigenvalue similarity respectively, across several language pairs: {Nepali, Finnish, Romanian, Gujarati, Hungarian}-English. Lastly, our analysis also points to the relatedness as well as the amount of related language data available as being key factors in determining the quality of embeddings achieved. △ Less

Submitted 28 March, 2022; originally announced March 2022.

Comments: Accepted non-archival Repl4NLP, ACL 2022

arXiv:2106.03694 [pdf]

Detection of marine floating plastic using Sentinel-2 imagery and machine learning models

Authors: Srikanta Sannigrahi, Bidroha Basu, Arunima Sarkar Basu, Francesco Pilla

Abstract: The increasing level of marine plastic pollution poses severe threats to the marine ecosystem and biodiversity. The present study attempted to explore the full functionality of open Sentinel satellite data and ML models for detecting and classifying floating plastic debris in Mytilene (Greece), Limassol (Cyprus), Calabria (Italy), and Beirut (Lebanon). Two ML models, i.e. Support Vector Machine (S… ▽ More The increasing level of marine plastic pollution poses severe threats to the marine ecosystem and biodiversity. The present study attempted to explore the full functionality of open Sentinel satellite data and ML models for detecting and classifying floating plastic debris in Mytilene (Greece), Limassol (Cyprus), Calabria (Italy), and Beirut (Lebanon). Two ML models, i.e. Support Vector Machine (SVM) and Random Forest (RF) were utilized to carry out the classification analysis. In-situ plastic location data was collected from the control experiment conducted in Mytilene, Greece and Limassol, Cyprus, and the same was considered for training the models. Both remote sensing bands and spectral indices were used for developing the ML models. A spectral signature profile for plastic was created for discriminating the floating plastic from other marine debris. A newly developed index, kernel Normalized Difference Vegetation Index (kNDVI), was incorporated into the modelling to examine its contribution to model performances. Both SVM and RF were performed well in five models and test case combinations. Among the two ML models, the highest performance was measured for the RF. The inclusion of kNDVI was found effective and increased the model performances, reflected by high balanced accuracy measured for model 2 (~80% to ~98 % for SVM and ~87% to ~97 % for RF). Using the best-performed model, an automated floating plastic detection system was developed and tested in Calabria and Beirut. For both sites, the trained model had detected the floating plastic with ~99% accuracy. Among the six predictors, the FDI was found the most important variable for detecting marine floating plastic. These findings collectively suggest that high-resolution remote sensing imagery and the automated ML models can be an effective alternative for the cost-effective detection of marine floating plastic. △ Less

Submitted 8 June, 2021; v1 submitted 27 May, 2021; originally announced June 2021.

Comments: 30 pages

Showing 1–5 of 5 results for author: Sannigrahi, S