Zum Hauptinhalt springen

Showing 1–14 of 14 results for author: Delobelle, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.04303  [pdf, other

    cs.CL cs.LG

    Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

    Authors: François Remy, Pieter Delobelle, Hayastan Avetisyan, Alfiya Khabibullina, Miryam de Lhoneux, Thomas Demeester

    Abstract: The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer strategy, trans-tokenization, designed to tackle this challenge and enable more efficient language adaptation. Our approach focuses on adapting a high-resource monolin… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: Accepted at COLM 2024

  2. arXiv:2408.02520  [pdf, other

    cs.CL

    OneLove beyond the field -- A few-shot pipeline for topic and sentiment analysis during the FIFA World Cup in Qatar

    Authors: Christoph Rauchegger, Sonja Mei Wang, Pieter Delobelle

    Abstract: The FIFA World Cup in Qatar was discussed extensively in the news and on social media. Due to news reports with allegations of human rights violations, there were calls to boycott it. Wearing a OneLove armband was part of a planned protest activity. Controversy around the armband arose when FIFA threatened to sanction captains who wear it. To understand what topics Twitter users Tweeted about and… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: Accepted at KONVENS 2024

  3. arXiv:2407.12824  [pdf, other

    cs.CL cs.AI

    Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models

    Authors: Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, Pau Rodríguez

    Abstract: An important issue with Large Language Models (LLMs) is their undesired ability to generate toxic language. In this work, we show that the neurons responsible for toxicity can be determined by their power to discriminate toxic sentences, and that toxic language can be mitigated by reducing their activation levels proportionally to this power. We propose AUROC adaptation (AurA), an intervention tha… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: ICML 2024, 8 pages + appendix

  4. arXiv:2310.03477  [pdf, other

    cs.CL cs.AI

    Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation

    Authors: François Remy, Pieter Delobelle, Bettina Berendt, Kris Demuynck, Thomas Demeester

    Abstract: Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. In this study, we propose a novel model conversion strategy to address this issue, adapting high-resources monolingual language models to a new target language. By generalizing over a word translation dictionary encompassing both the source and target langua… ▽ More

    Submitted 5 October, 2023; originally announced October 2023.

    Comments: As first reviewed at TACL

  5. arXiv:2301.12855  [pdf, other

    cs.CL

    How Far Can It Go?: On Intrinsic Gender Bias Mitigation for Text Classification

    Authors: Ewoenam Tokpo, Pieter Delobelle, Bettina Berendt, Toon Calders

    Abstract: To mitigate gender bias in contextualized language models, different intrinsic mitigation strategies have been proposed, alongside many bias metrics. Considering that the end use of these language models is for downstream tasks like text classification, it is important to understand how these intrinsic bias mitigation strategies actually translate to fairness in downstream tasks and the extent of… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

  6. arXiv:2211.08192  [pdf, other

    cs.CL cs.LG

    RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

    Authors: Pieter Delobelle, Thomas Winters, Bettina Berendt

    Abstract: Large transformer-based language models, e.g. BERT and GPT-3, outperform previous architectures on most natural language processing tasks. Such language models are first pre-trained on gigantic corpora of text and later used as base-model for finetuning on a particular task. Since the pre-training step is usually not repeated, base models are not up-to-date with the latest information. In this pap… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: 9 pages, 1 figure, 3 tables

  7. arXiv:2207.04546  [pdf, other

    cs.CL cs.CY cs.LG

    FairDistillation: Mitigating Stereotyping in Language Models

    Authors: Pieter Delobelle, Bettina Berendt

    Abstract: Large pre-trained language models are successfully being used in a variety of tasks, across many languages. With this ever-increasing usage, the risk of harmful side effects also rises, for example by reproducing and reinforcing stereotypes. However, detecting and mitigating these harms is difficult to do in general and becomes computationally expensive when tackling multiple languages or when con… ▽ More

    Submitted 16 September, 2022; v1 submitted 10 July, 2022; originally announced July 2022.

    Comments: Accepted at ECML-PKDD 2022

  8. arXiv:2204.13511  [pdf, other

    cs.CL

    RobBERTje: a Distilled Dutch BERT Model

    Authors: Pieter Delobelle, Thomas Winters, Bettina Berendt

    Abstract: Pre-trained large-scale language models such as BERT have gained a lot of attention thanks to their outstanding performance on a wide range of natural language tasks. However, due to their large number of parameters, they are resource-intensive both to deploy and to fine-tune. Researchers have created several methods for distilling language models into smaller ones to increase efficiency, with a s… ▽ More

    Submitted 28 April, 2022; originally announced April 2022.

    Comments: Published in CLIN journal

    Journal ref: Computational Linguistics in the Netherlands Journal 2021

  9. arXiv:2112.07447  [pdf, other

    cs.CL cs.CY cs.LG

    Measuring Fairness with Biased Rulers: A Survey on Quantifying Biases in Pretrained Language Models

    Authors: Pieter Delobelle, Ewoenam Kwaku Tokpo, Toon Calders, Bettina Berendt

    Abstract: An increasing awareness of biased patterns in natural language processing resources, like BERT, has motivated many metrics to quantify `bias' and `fairness'. But comparing the results of different metrics and the works that evaluate with such metrics remains difficult, if not outright impossible. We survey the existing literature on fairness metrics for pretrained language models and experimentall… ▽ More

    Submitted 14 December, 2021; originally announced December 2021.

    Comments: 15 pages, 4 figures, 3 tables

  10. arXiv:2104.09947  [pdf, other

    cs.CL cs.SI

    Measuring Shifts in Attitudes Towards COVID-19 Measures in Belgium Using Multilingual BERT

    Authors: Kristen Scott, Pieter Delobelle, Bettina Berendt

    Abstract: We classify seven months' worth of Belgian COVID-related Tweets using multilingual BERT and relate them to their governments' COVID measures. We classify Tweets by their stated opinion on Belgian government curfew measures (too strict, ok, too loose). We examine the change in topics discussed and views expressed over time and in reference to dates of related events such as implementation of new me… ▽ More

    Submitted 20 April, 2021; originally announced April 2021.

    Comments: 5 pages, 2 figures

  11. arXiv:2010.13652  [pdf, other

    cs.CL cs.AI

    Dutch Humor Detection by Generating Negative Examples

    Authors: Thomas Winters, Pieter Delobelle

    Abstract: Detecting if a text is humorous is a hard task to do computationally, as it usually requires linguistic and common sense insights. In machine learning, humor detection is usually modeled as a binary classification task, trained to predict if the given text is a joke or another type of text. Rather than using completely different non-humorous texts, we propose using text generation algorithms for i… ▽ More

    Submitted 26 October, 2020; originally announced October 2020.

    Comments: Accepted at the Proceedings of the 32st Benelux Conference on Artificial Intelligence (BNAIC 2020) and the 29th Belgian Dutch Conference on Machine Learning (Benelearn 2020)

    MSC Class: 68T50 ACM Class: I.2.7; I.2.6

  12. arXiv:2005.06852  [pdf, other

    cs.LG cs.AI stat.ML

    Ethical Adversaries: Towards Mitigating Unfairness with Adversarial Machine Learning

    Authors: Pieter Delobelle, Paul Temple, Gilles Perrouin, Benoît Frénay, Patrick Heymans, Bettina Berendt

    Abstract: Machine learning is being integrated into a growing number of critical systems with far-reaching impacts on society. Unexpected behaviour and unfair decision processes are coming under increasing scrutiny due to this widespread use and its theoretical considerations. Individuals, as well as organisations, notice, test, and criticize unfair results to hold model designers and deployers accountable.… ▽ More

    Submitted 1 September, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

    Comments: 15 pages, 3 figures, 1 table

  13. arXiv:2001.06286  [pdf, other

    cs.CL cs.LG

    RobBERT: a Dutch RoBERTa-based Language Model

    Authors: Pieter Delobelle, Thomas Winters, Bettina Berendt

    Abstract: Pre-trained language models have been dominating the field of natural language processing in recent years, and have led to significant performance gains for various complex natural language tasks. One of the most prominent pre-trained language models is BERT, which was released as an English as well as a multilingual version. Although multilingual BERT performs well on many tasks, recent studies s… ▽ More

    Submitted 16 September, 2020; v1 submitted 17 January, 2020; originally announced January 2020.

    Comments: 11 pages, 4 tables, 3 figures. Accepted in EMNLP Findings

  14. arXiv:1910.13793  [pdf, other

    cs.CL

    Time to Take Emoji Seriously: They Vastly Improve Casual Conversational Models

    Authors: Pieter Delobelle, Bettina Berendt

    Abstract: Graphical emoji are ubiquitous in modern-day online conversations. So is a single thumbs-up emoji able to signify an agreement, without any words. We argue that the current state-of-the-art systems are ill-equipped to correctly interpret these emoji, especially in a conversational context. However, in a casual context, the benefits might be high: a better understanding of users' utterances and mor… ▽ More

    Submitted 30 October, 2019; originally announced October 2019.

    Comments: Accepted at Benelearn 2019