Zum Hauptinhalt springen

Showing 1–28 of 28 results for author: Madabushi, H T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13297  [pdf, other

    cs.CL

    SpeciaLex: A Benchmark for In-Context Specialized Lexicon Learning

    Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi

    Abstract: Specialized lexicons are collections of words with associated constraints such as special definitions, specific roles, and intended target audiences. These constraints are necessary for content generation and documentation tasks (e.g., writing technical manuals or children's books), where the goal is to reduce the ambiguity of text content and increase its overall readability for a specific group… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  2. arXiv:2407.03181  [pdf, other

    cs.CL

    Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models

    Authors: Haritz Puerto, Tilek Chubakov, Xiaodan Zhu, Harish Tayyar Madabushi, Iryna Gurevych

    Abstract: Requiring a Large Language Model to generate intermediary reasoning steps has been shown to be an effective way of boosting performance. In fact, it has been found that instruction tuning on these intermediary reasoning steps improves model performance. In this work, we present a novel method of further improving performance by requiring models to compare multiple reasoning chains before generatin… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  3. arXiv:2406.16167  [pdf, other

    cs.CL

    FS-RAG: A Frame Semantics Based Approach for Improved Factual Accuracy in Large Language Models

    Authors: Harish Tayyar Madabushi

    Abstract: We present a novel extension to Retrieval Augmented Generation with the goal of mitigating factual inaccuracies in the output of large language models. Specifically, our method draws on the cognitive linguistic theory of frame semantics for the indexing and retrieval of factual information relevant to helping large language models answer queries. We conduct experiments to demonstrate the effective… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: program code and prompts available at https://github.com/H-TayyarMadabushi/A-Frame-Semantics-based-approach-for-Improved-Factual-Accuracy-in-Large-Language-Models

  4. arXiv:2403.11025  [pdf, other

    cs.CL

    Pre-Trained Language Models Represent Some Geographic Populations Better Than Others

    Authors: Jonathan Dunn, Benjamin Adams, Harish Tayyar Madabushi

    Abstract: This paper measures the skew in how well two families of LLMs represent diverse geographic populations. A spatial probing task is used with geo-referenced corpora to measure the degree to which pre-trained language models from the OPT and BLOOM series represent diverse populations around the world. Results show that these models perform much better for some populations than others. In particular,… ▽ More

    Submitted 16 March, 2024; originally announced March 2024.

  5. arXiv:2403.04872  [pdf, other

    cs.CL

    Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text

    Authors: Frances A. Laureano De Leon, Harish Tayyar Madabushi, Mark Lee

    Abstract: Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models hand… ▽ More

    Submitted 7 May, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: Accepted for publication at Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Data and code available at https://github.com/francesita/code-mixed-probes

  6. arXiv:2402.12593  [pdf, other

    cs.CL

    Standardize: Aligning Language Models with Expert-Defined Standards for Content Generation

    Authors: Joseph Marvin Imperial, Gail Forey, Harish Tayyar Madabushi

    Abstract: Domain experts across engineering, healthcare, and education follow strict standards for producing quality content such as technical manuals, medication instructions, and children's reading materials. However, current works in controllable text generation have yet to explore using these standards as references for control. Towards this end, we introduce Standardize, a retrieval-style in-context le… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

  7. arXiv:2401.07923  [pdf, other

    cs.CL

    Word Boundary Information Isn't Useful for Encoder Language Models

    Authors: Edward Gow-Smith, Dylan Phelps, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

    Abstract: All existing transformer-based approaches to NLP using subword tokenisation algorithms encode whitespace (word boundary information) through the use of special space symbols (such as \#\# or \_) forming part of tokens. These symbols have been shown to a) lead to reduced morphological validity of tokenisations, and b) give substantial vocabulary redundancy. As such, removing these symbols has been… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: Preprint

  8. arXiv:2309.05454  [pdf, other

    cs.CL

    Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models

    Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi

    Abstract: Readability metrics and standards such as Flesch Kincaid Grade Level (FKGL) and the Common European Framework of Reference for Languages (CEFR) exist to guide teachers and educators to properly assess the complexity of educational materials before administering them for classroom use. In this study, we select a diverse set of open and closed-source instruction-tuned language models and investigate… ▽ More

    Submitted 3 November, 2023; v1 submitted 11 September, 2023; originally announced September 2023.

    Comments: Final camera-ready for EMNLP GEM Workshop 2023

  9. arXiv:2309.01809  [pdf, other

    cs.CL

    Are Emergent Abilities in Large Language Models just In-Context Learning?

    Authors: Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, Iryna Gurevych

    Abstract: Large language models, comprising billions of parameters and pre-trained on extensive web-scale corpora, have been claimed to acquire certain capabilities without having been specifically trained on them. These capabilities, referred to as "emergent abilities," have been a driving force in discussions regarding the potentials and risks of language models. A key challenge in evaluating emergent abi… ▽ More

    Submitted 15 July, 2024; v1 submitted 4 September, 2023; originally announced September 2023.

    Comments: Accepted to ACL 2024

  10. arXiv:2308.13315  [pdf

    cs.CL

    Construction Grammar and Language Models

    Authors: Harish Tayyar Madabushi, Laurence Romain, Petar Milin, Dagmar Divjak

    Abstract: Recent progress in deep learning and natural language processing has given rise to powerful models that are primarily trained on a cloze-like task and show some evidence of having access to substantial linguistic information, including some constructional knowledge. This groundbreaking discovery presents an exciting opportunity for a synergistic relationship between computational methods and Const… ▽ More

    Submitted 4 September, 2023; v1 submitted 25 August, 2023; originally announced August 2023.

    Comments: Accepted for publication in The Cambridge Handbook of Construction Grammar, edited by Mirjam Fried and Kiki Nikiforidou. To appear in 2024

  11. arXiv:2210.17301  [pdf, other

    cs.CL

    Effective Cross-Task Transfer Learning for Explainable Natural Language Inference with T5

    Authors: Irina Bigoulaeva, Rachneet Sachdeva, Harish Tayyar Madabushi, Aline Villavicencio, Iryna Gurevych

    Abstract: We compare sequential fine-tuning with a model for multi-task learning in the context where we are interested in boosting performance on two tasks, one of which depends on the other. We test these models on the FigLang2022 shared task which requires participants to predict language inference labels on figurative language along with corresponding textual explanations of the inference predictions. O… ▽ More

    Submitted 31 October, 2022; originally announced October 2022.

    Comments: Accepted for publication in the Proceedings of the Second Workshop on Figurative Language Processing (colocated with EMNLP 2022). Code and models at https://github.com/Rachneet/cross-task-figurative-explanations

  12. arXiv:2206.04184  [pdf, other

    cs.CL

    Abstraction not Memory: BERT and the English Article System

    Authors: Harish Tayyar Madabushi, Dagmar Divjak, Petar Milin

    Abstract: Article prediction is a task that has long defied accurate linguistic description. As such, this task is ideally suited to evaluate models on their ability to emulate native-speaker intuition. To this end, we compare the performance of native English speakers and pre-trained models on the task of article prediction set up as a three way choice (a/an, the, zero). Our experiments with BERT show that… ▽ More

    Submitted 8 June, 2022; originally announced June 2022.

    Comments: Accepted for publication at 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2022). Data and code available at https://github.com/H-TayyarMadabushi/Abstraction-not-Memory-BERT-and-the-English-Article-System-NAACL-2022

  13. arXiv:2205.11306  [pdf, ps, other

    cs.CL

    Sample Efficient Approaches for Idiomaticity Detection

    Authors: Dylan Phelps, Xuan-Rui Fan, Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

    Abstract: Deep neural models, in particular Transformer-based pre-trained language models, require a significant amount of data to train. This need for data tends to lead to problems when dealing with idiomatic multiword expressions (MWEs), which are inherently less frequent in natural text. As such, this work explores sample efficient methods of idiomaticity detection. In particular we study the impact of… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

  14. arXiv:2204.10050  [pdf, other

    cs.CL

    SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding

    Authors: Harish Tayyar Madabushi, Edward Gow-Smith, Marcos Garcia, Carolina Scarton, Marco Idiart, Aline Villavicencio

    Abstract: This paper presents the shared task on Multilingual Idiomaticity Detection and Sentence Embedding, which consists of two subtasks: (a) a binary classification task aimed at identifying whether a sentence contains an idiomatic expression, and (b) a task based on semantic text similarity which requires the model to adequately represent potentially idiomatic expressions in context. Each subtask inclu… ▽ More

    Submitted 30 May, 2022; v1 submitted 21 April, 2022; originally announced April 2022.

    Comments: Data available at https://github.com/H-TayyarMadabushi/SemEval_2022_Task2-idiomaticity and competition website at https://sites.google.com/view/semeval2022task2-idiomaticity

  15. arXiv:2204.05185  [pdf, other

    cs.CL cs.LG

    Uniform Complexity for Text Generation

    Authors: Joseph Marvin Imperial, Harish Tayyar Madabushi

    Abstract: Large language models (LLMs) have shown promising results in a wide array of generative NLP tasks, such as summarization and machine translation. In the context of narrative generation, however, existing models still do not capture factors that contribute to producing consistent text. For instance, it is logical that a piece of text or a story should be uniformly readable throughout and that this… ▽ More

    Submitted 19 October, 2023; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: Final camera-ready for EMNLP 2023

  16. arXiv:2204.04058  [pdf, other

    cs.CL

    Improving Tokenisation by Alternative Treatment of Spaces

    Authors: Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

    Abstract: Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based language models all use subword tokenisation algorithms to process input text. Existing algorithms have problems, often producing tokenisations of limited linguistic validity, and representing equivalent strings differently depending on their position within a word. We hypothesise that these problems hin… ▽ More

    Submitted 22 October, 2022; v1 submitted 8 April, 2022; originally announced April 2022.

    Comments: EMNLP 2022

  17. arXiv:2110.05663  [pdf, other

    cs.CL

    Learned Construction Grammars Converge Across Registers Given Increased Exposure

    Authors: Jonathan Dunn, Harish Tayyar Madabushi

    Abstract: This paper measures the impact of increased exposure on whether learned construction grammars converge onto shared representations when trained on data from different registers. Register influences the frequency of constructions, with some structures common in formal but not informal usage. We expect that a grammar induction algorithm exposed to different registers will acquire different construct… ▽ More

    Submitted 11 October, 2021; originally announced October 2021.

  18. UoB at SemEval-2021 Task 5: Extending Pre-Trained Language Models to Include Task and Domain-Specific Information for Toxic Span Prediction

    Authors: Erik Yan, Harish Tayyar Madabushi

    Abstract: Toxicity is pervasive in social media and poses a major threat to the health of online communities. The recent introduction of pre-trained language models, which have achieved state-of-the-art results in many NLP tasks, has transformed the way in which we approach natural language processing. However, the inherent nature of pre-training means that they are unlikely to capture task-specific statist… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

    Comments: Published in Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021); Code available at: https://github.com/erikdyan/toxic_span_detection

    Journal ref: 2021.semeval-1.28 (2021) 243-248

  19. arXiv:2109.04413  [pdf, other

    cs.CL

    AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models

    Authors: Harish Tayyar Madabushi, Edward Gow-Smith, Carolina Scarton, Aline Villavicencio

    Abstract: Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms. Therefore, datasets and methods to improve the representation of MWEs are urgently needed. Existing datasets are limited to providing the degree of idiomaticity of expressions al… ▽ More

    Submitted 9 September, 2021; originally announced September 2021.

    Comments: Findings of EMNLP 2021. Code available at: https://github.com/H-TayyarMadabushi/AStitchInLanguageModels

  20. arXiv:2011.04134  [pdf, other

    cs.CL

    CxGBERT: BERT meets Construction Grammar

    Authors: Harish Tayyar Madabushi, Laurence Romain, Dagmar Divjak, Petar Milin

    Abstract: While lexico-semantic elements no doubt capture a large amount of linguistic information, it has been argued that they do not capture all information contained in text. This assumption is central to constructionist approaches to language which argue that language consists of constructions, learned pairings of a form and a function or meaning that are either frequent or have a meaning that cannot b… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

    Comments: 28th International Conference on Computational Linguistics (COLING 2020)

  21. arXiv:2010.09078  [pdf, other

    cs.CL

    Incorporating Count-Based Features into Pre-Trained Models for Improved Stance Detection

    Authors: Anushka Prakash, Harish Tayyar Madabushi

    Abstract: The explosive growth and popularity of Social Media has revolutionised the way we communicate and collaborate. Unfortunately, this same ease of accessing and sharing information has led to an explosion of misinformation and propaganda. Given that stance detection can significantly aid in veracity prediction, this work focuses on boosting automated stance detection, a task on which pre-trained mode… ▽ More

    Submitted 18 October, 2020; originally announced October 2020.

  22. arXiv:2010.09072  [pdf, other

    cs.CL

    UoB at SemEval-2020 Task 1: Automatic Identification of Novel Word Senses

    Authors: Eleri Sarsfield, Harish Tayyar Madabushi

    Abstract: Much as the social landscape in which languages are spoken shifts, language too evolves to suit the needs of its users. Lexical semantic change analysis is a burgeoning field of semantic analysis which aims to trace changes in the meanings of words over time. This paper presents an approach to lexical semantic change detection based on Bayesian word sense induction suitable for novel word sense id… ▽ More

    Submitted 18 October, 2020; originally announced October 2020.

  23. arXiv:2010.07988  [pdf, other

    cs.CL

    CXP949 at WNUT-2020 Task 2: Extracting Informative COVID-19 Tweets -- RoBERTa Ensembles and The Continued Relevance of Handcrafted Features

    Authors: Calum Perrio, Harish Tayyar Madabushi

    Abstract: This paper presents our submission to Task 2 of the Workshop on Noisy User-generated Text. We explore improving the performance of a pre-trained transformer-based language model fine-tuned for text classification through an ensemble implementation that makes use of corpus level information and a handcrafted feature. We test the effectiveness of including the aforementioned features in accommodatin… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

  24. arXiv:2008.08547  [pdf, ps, other

    cs.CL

    UoB at SemEval-2020 Task 12: Boosting BERT with Corpus Level Information

    Authors: Wah Meng Lim, Harish Tayyar Madabushi

    Abstract: Pre-trained language model word representation, such as BERT, have been extremely successful in several Natural Language Processing tasks significantly improving on the state-of-the-art. This can largely be attributed to their ability to better capture semantic information contained within a sentence. Several tasks, however, can benefit from information available at a corpus level, such as Term Fr… ▽ More

    Submitted 19 August, 2020; originally announced August 2020.

  25. arXiv:2006.04597  [pdf, ps, other

    cs.CL cs.LG cs.NE

    CS-Embed at SemEval-2020 Task 9: The effectiveness of code-switched word embeddings for sentiment analysis

    Authors: Frances Adriana Laureano De Leon, Florimond Guéniat, Harish Tayyar Madabushi

    Abstract: The growing popularity and applications of sentiment analysis of social media posts has naturally led to sentiment analysis of posts written in multiple languages, a practice known as code-switching. While recent research into code-switched posts has focused on the use of multilingual word embeddings, these embeddings were not trained on code-switched data. In this work, we present word-embeddings… ▽ More

    Submitted 7 September, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: Accepted at SemEval-2020, COLING

  26. arXiv:2003.11563  [pdf, other

    cs.CL cs.LG stat.ML

    Cost-Sensitive BERT for Generalisable Sentence Classification with Imbalanced Data

    Authors: Harish Tayyar Madabushi, Elena Kochkina, Michael Castelle

    Abstract: The automatic identification of propaganda has gained significance in recent years due to technological and social changes in the way news is generated and consumed. That this task can be addressed effectively using BERT, a powerful new architecture which can be fine-tuned for text classification tasks, is not surprising. However, propaganda detection, like other tasks that deal with news document… ▽ More

    Submitted 16 March, 2020; originally announced March 2020.

    Comments: NLP4IF 2019

  27. arXiv:2003.03813  [pdf

    cs.CL cs.LG stat.ML

    Keeping it simple: Implementation and performance of the proto-principle of adaptation and learning in the language sciences

    Authors: Petar Milin, Harish Tayyar Madabushi, Michael Croucher, Dagmar Divjak

    Abstract: In this paper we present the Widrow-Hoff rule and its applications to language data. After contextualizing the rule historically and placing it in the chain of neurally inspired artificial learning models, we explain its rationale and implementational considerations. Using a number of case studies we illustrate how the Widrow-Hoff rule offers unexpected opportunities for the computational simulati… ▽ More

    Submitted 28 August, 2021; v1 submitted 8 March, 2020; originally announced March 2020.

  28. arXiv:1908.05441  [pdf, other

    cs.CL cs.AI

    Multi-class Hierarchical Question Classification for Multiple Choice Science Exams

    Authors: Dongfang Xu, Peter Jansen, Jaycie Martin, Zhengnan Xie, Vikas Yadav, Harish Tayyar Madabushi, Oyvind Tafjord, Peter Clark

    Abstract: Prior work has demonstrated that question classification (QC), recognizing the problem domain of a question, can help answer it more accurately. However, developing strong QC algorithms has been hindered by the limited size and complexity of annotated data available. To address this, we present the largest challenge dataset for QC, containing 7,787 science exam questions paired with detailed class… ▽ More

    Submitted 15 August, 2019; originally announced August 2019.