Search | arXiv e-print repository

Retrieve, Generate, Evaluate: A Case Study for Medical Paraphrases Generation with Small Language Models

Authors: Ioana Buhnila, Aman Sinha, Mathieu Constant

Abstract: Recent surge in the accessibility of large language models (LLMs) to the general population can lead to untrackable use of such models for medical-related recommendations. Language generation via LLMs models has two key problems: firstly, they are prone to hallucination and therefore, for any medical purpose they require scientific and factual grounding; secondly, LLMs pose tremendous challenge to… ▽ More Recent surge in the accessibility of large language models (LLMs) to the general population can lead to untrackable use of such models for medical-related recommendations. Language generation via LLMs models has two key problems: firstly, they are prone to hallucination and therefore, for any medical purpose they require scientific and factual grounding; secondly, LLMs pose tremendous challenge to computational resources due to their gigantic model size. In this work, we introduce pRAGe, a pipeline for Retrieval Augmented Generation and evaluation of medical paraphrases generation using Small Language Models (SLM). We study the effectiveness of SLMs and the impact of external knowledge base for medical paraphrase generation in French. △ Less

Submitted 23 July, 2024; originally announced July 2024.

Comments: KnowledgeableLM 2024

arXiv:2407.12626 [pdf, other]

Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?

Authors: Aman Sinha, Timothee Mickus, Marianne Clausel, Mathieu Constant, Xavier Coubez

Abstract: The success of pretrained language models (PLMs) across a spate of use-cases has led to significant investment from the NLP community towards building domain-specific foundational models. On the other hand, in mission critical settings such as biomedical applications, other aspects also factor in-chief of which is a model's ability to produce reasonable estimates of its own uncertainty. In the pre… ▽ More The success of pretrained language models (PLMs) across a spate of use-cases has led to significant investment from the NLP community towards building domain-specific foundational models. On the other hand, in mission critical settings such as biomedical applications, other aspects also factor in-chief of which is a model's ability to produce reasonable estimates of its own uncertainty. In the present study, we discuss these two desiderata through the lens of how they shape the entropy of a model's output probability distribution. We find that domain specificity and uncertainty awareness can often be successfully combined, but the exact task at hand weighs in much more strongly. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: BioNLP 2024

arXiv:2309.08698 [pdf, other]

No Imputation Needed: A Switch Approach to Irregularly Sampled Time Series

Authors: Rohit Agarwal, Aman Sinha, Ayan Vishwakarma, Xavier Coubez, Marianne Clausel, Mathieu Constant, Alexander Horsch, Dilip K. Prasad

Abstract: Modeling irregularly-sampled time series (ISTS) is challenging because of missing values. Most existing methods focus on handling ISTS by converting irregularly sampled data into regularly sampled data via imputation. These models assume an underlying missing mechanism, which may lead to unwanted bias and sub-optimal performance. We present SLAN (Switch LSTM Aggregate Network), which utilizes a gr… ▽ More Modeling irregularly-sampled time series (ISTS) is challenging because of missing values. Most existing methods focus on handling ISTS by converting irregularly sampled data into regularly sampled data via imputation. These models assume an underlying missing mechanism, which may lead to unwanted bias and sub-optimal performance. We present SLAN (Switch LSTM Aggregate Network), which utilizes a group of LSTMs to model ISTS without imputation, eliminating the assumption of any underlying process. It dynamically adapts its architecture on the fly based on the measured sensors using switches. SLAN exploits the irregularity information to explicitly capture each sensor's local summary and maintains a global summary state throughout the observational period. We demonstrate the efficacy of SLAN on two public datasets, namely, MIMIC-III, and Physionet 2012. △ Less

Submitted 19 August, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

arXiv:2206.03529 [pdf, other]

How to Dissect a Muppet: The Structure of Transformer Embedding Spaces

Authors: Timothee Mickus, Denis Paperno, Mathieu Constant

Abstract: Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative o… ▽ More Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights. △ Less

Submitted 7 June, 2022; originally announced June 2022.

Comments: Accepted at TACL (pre-MIT Press publication version)

arXiv:2205.13858 [pdf, other]

Semeval-2022 Task 1: CODWOE -- Comparing Dictionaries and Word Embeddings

Authors: Timothee Mickus, Kees van Deemter, Mathieu Constant, Denis Paperno

Abstract: Word embeddings have advanced the state of the art in NLP across numerous tasks. Understanding the contents of dense neural representations is of utmost interest to the computational semantics community. We propose to focus on relating these opaque word vectors with human-readable definitions, as found in dictionaries. This problem naturally divides into two subtasks: converting definitions into e… ▽ More Word embeddings have advanced the state of the art in NLP across numerous tasks. Understanding the contents of dense neural representations is of utmost interest to the computational semantics community. We propose to focus on relating these opaque word vectors with human-readable definitions, as found in dictionaries. This problem naturally divides into two subtasks: converting definitions into embeddings, and converting embeddings into definitions. This task was conducted in a multilingual setting, using comparable sets of embeddings trained homogeneously. △ Less

Submitted 27 May, 2022; originally announced May 2022.

arXiv:2108.07708 [pdf, other]

A Game Interface to Study Semantic Grounding in Text-Based Models

Authors: Timothee Mickus, Mathieu Constant, Denis Paperno

Abstract: Can language models learn grounded representations from text distribution alone? This question is both central and recurrent in natural language processing; authors generally agree that grounding requires more than textual distribution. We propose to experimentally test this claim: if any two words have different meanings and yet cannot be distinguished from distribution alone, then grounding is o… ▽ More Can language models learn grounded representations from text distribution alone? This question is both central and recurrent in natural language processing; authors generally agree that grounding requires more than textual distribution. We propose to experimentally test this claim: if any two words have different meanings and yet cannot be distinguished from distribution alone, then grounding is out of the reach of text-based models. To that end, we present early work on an online game for the collection of human judgments on the distributional similarity of word pairs in five languages. We further report early results of our data collection campaign. △ Less

Submitted 17 August, 2021; originally announced August 2021.

arXiv:1911.05758 [pdf, other]

doi 10.7275/t778-ja71

What do you mean, BERT? Assessing BERT as a Distributional Semantics Model

Authors: Timothee Mickus, Denis Paperno, Mathieu Constant, Kees van Deemter

Abstract: Contextualized word embeddings, i.e. vector representations for words in context, are naturally seen as an extension of previous noncontextual distributional semantic models. In this work, we focus on BERT, a deep neural network that produces contextualized embeddings and has set the state-of-the-art in several semantic tasks, and study the semantic coherence of its embedding space. While showing… ▽ More Contextualized word embeddings, i.e. vector representations for words in context, are naturally seen as an extension of previous noncontextual distributional semantic models. In this work, we focus on BERT, a deep neural network that produces contextualized embeddings and has set the state-of-the-art in several semantic tasks, and study the semantic coherence of its embedding space. While showing a tendency towards coherence, BERT does not fully live up to the natural expectations for a semantic vector space. In particular, we find that the position of the sentence in which a word occurs, while having no meaning correlates, leaves a noticeable trace on the word embeddings and disturbs similarity relationships. △ Less

Submitted 8 May, 2020; v1 submitted 13 November, 2019; originally announced November 2019.

Journal ref: Proceedings of the Society for Computation in Linguistics: Vol. 3 (2020), Article 34

arXiv:1911.05715 [pdf, other]

Mark my Word: A Sequence-to-Sequence Approach to Definition Modeling

Authors: Timothee Mickus, Denis Paperno, Mathieu Constant

Abstract: Defining words in a textual context is a useful task both for practical purposes and for gaining insight into distributed word representations. Building on the distributional hypothesis, we argue here that the most natural formalization of definition modeling is to treat it as a sequence-to-sequence task, rather than a word-to-sequence task: given an input sequence with a highlighted word, generat… ▽ More Defining words in a textual context is a useful task both for practical purposes and for gaining insight into distributed word representations. Building on the distributional hypothesis, we argue here that the most natural formalization of definition modeling is to treat it as a sequence-to-sequence task, rather than a word-to-sequence task: given an input sequence with a highlighted word, generate a contextually appropriate definition for it. We implement this approach in a Transformer-based sequence-to-sequence model. Our proposal allows to train contextualization and definition generation in an end-to-end fashion, which is a conceptual improvement over earlier works. We achieve state-of-the-art results both in contextual and non-contextual definition modeling. △ Less

Submitted 13 November, 2019; originally announced November 2019.

Journal ref: Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing, 30 September, 2019, University of Turku, Turku, Finland

arXiv:1404.1872 [pdf]

Intégration des données d'un lexique syntaxique dans un analyseur syntaxique probabiliste

Authors: Anthony Sigogne, Matthieu Constant, Eric Laporte

Abstract: This article reports the evaluation of the integration of data from a syntactic-semantic lexicon, the Lexicon-Grammar of French, into a syntactic parser. We show that by changing the set of labels for verbs and predicational nouns, we can improve the performance on French of a non-lexicalized probabilistic parser. This article reports the evaluation of the integration of data from a syntactic-semantic lexicon, the Lexicon-Grammar of French, into a syntactic parser. We show that by changing the set of labels for verbs and predicational nouns, we can improve the performance on French of a non-lexicalized probabilistic parser. △ Less

Submitted 7 April, 2014; originally announced April 2014.

Comments: in French

Journal ref: Penser le Lexique-Grammaire. Perspectives actuelles, Fryni Kakoyianni-Doa (Ed.) (2014) 505-516

arXiv:1110.6631 [pdf, ps, other]

doi 10.1159/000339272

New first trimester crown-rump length's equations optimized by structured data collection from a French general population

Authors: Marc Constant, Viet Chi Tran, Bernard Benoît, Francis Vasseur

Abstract: --- Objectives --- Prior to foetal karyotyping, the likelihood of Down's syndrome is often determined combining maternal age, serum free beta-HCG, PAPP-A levels and embryonic measurements of crown-rump length and nuchal translucency for gestational ages between 11 and 13 weeks. It appeared important to get a precise knowledge of these scan parameters' normal values during the first trimester. This… ▽ More --- Objectives --- Prior to foetal karyotyping, the likelihood of Down's syndrome is often determined combining maternal age, serum free beta-HCG, PAPP-A levels and embryonic measurements of crown-rump length and nuchal translucency for gestational ages between 11 and 13 weeks. It appeared important to get a precise knowledge of these scan parameters' normal values during the first trimester. This paper focused on crown-rump length. --- METHODS --- 402 pregnancies from in-vitro fertilization allowing a precise estimation of foetal ages (FA) were used to determine the best model that describes crown-rump length (CRL) as a function of FA. Scan measures by a single operator from 3846 spontaneous pregnancies representative of the general population from Northern France were used to build a mathematical model linking FA and CRL in a context as close as possible to normal scan screening used in Down's syndrome likelihood determination. We modeled both CRL as a function of FA and FA as a function of CRL. For this, we used a clear methodology and performed regressions with heteroskedastic corrections and robust regressions. The results were compared by cross-validation to retain the equations with the best predictive power. We also studied the errors between observed and predicted values. --- Results --- Data from 513 spontaneous pregnancies allowed to model CRL as a function of age of foetal age. The best model was a polynomial of degree 2. Datation with our equation that models spontaneous pregnancies from a general population was in quite agreement with objective datations obtained from 402 IVF pregnancies and thus support the validity of our model. The most precise measure of CRL was when the SD was minimal (1.83mm), for a CRL of 23.6 mm where our model predicted a 49.4 days of foetal age. Our study allowed to model the SD from 30 to 90 days of foetal age and offers the opportunity of using Zscores in the future to detect growth abnormalities. --- Conclusion --- With powerful statistical tools we report a good modeling of the first trimester embryonic growth in the general population allowing a better knowledge of the date of fertilization useful in the ultrasound screening of Down's syndrome. The optimal period to measure CRL and predict foetal age was 49.4 days (9 weeks of gestational age). Our results open the way to the detection of foetal growth abnormalities using CRL Zscores throughout the first trimester. △ Less

Submitted 30 October, 2011; originally announced October 2011.

Journal ref: Fetal Diagnosis and Therapy 32, 4 (2012) 277-287

arXiv:1005.5596 [pdf]

A generic tool to generate a lexicon for NLP from Lexicon-Grammar tables

Authors: Matthieu Constant, Elsa Tolone

Abstract: Lexicon-Grammar tables constitute a large-coverage syntactic lexicon but they cannot be directly used in Natural Language Processing (NLP) applications because they sometimes rely on implicit information. In this paper, we introduce LGExtract, a generic tool for generating a syntactic lexicon for NLP from the Lexicon-Grammar tables. It is based on a global table that contains undefined information… ▽ More Lexicon-Grammar tables constitute a large-coverage syntactic lexicon but they cannot be directly used in Natural Language Processing (NLP) applications because they sometimes rely on implicit information. In this paper, we introduce LGExtract, a generic tool for generating a syntactic lexicon for NLP from the Lexicon-Grammar tables. It is based on a global table that contains undefined information and on a unique extraction script including all operations to be performed for all tables. We also present an experiment that has been conducted to generate a new lexicon of French verbs and predicative nouns. △ Less

Submitted 31 May, 2010; originally announced May 2010.

Journal ref: Actes du 27e Colloque international sur le lexique et la grammaire (L'Aquila, 10-13 septembre 2008). Seconde partie, Michele De Gioia (Ed.) (2010) pages 79-93

arXiv:0711.3691 [pdf, ps, other]

Outilex, plate-forme logicielle de traitement de textes écrits

Authors: Olivier Blanc, Matthieu Constant, Eric Laporte

Abstract: The Outilex software platform, which will be made available to research, development and industry, comprises software components implementing all the fundamental operations of written text processing: processing without lexicons, exploitation of lexicons and grammars, language resource management. All data are structured in XML formats, and also in more compact formats, either readable or binary… ▽ More The Outilex software platform, which will be made available to research, development and industry, comprises software components implementing all the fundamental operations of written text processing: processing without lexicons, exploitation of lexicons and grammars, language resource management. All data are structured in XML formats, and also in more compact formats, either readable or binary, whenever necessary; the required format converters are included in the platform; the grammar formats allow for combining statistical approaches with resource-based approaches. Manually constructed lexicons for French and English, originating from the LADL, and of substantial coverage, will be distributed with the platform under LGPL-LR license. △ Less

Submitted 27 November, 2007; v1 submitted 23 November, 2007; originally announced November 2007.

Journal ref: Dans Verbum ex machina. Proceedings of TALN - Outilex, plate-forme logicielle de traitement de textes écrits, Louvain : Belgique (2006)

Showing 1–12 of 12 results for author: Constant, M