Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?

Aman Sinha\clubsuit,\heartsuit, Timothee Mickus\spadesuit, Marianne Clausel\clubsuit,
Mathieu Constant\clubsuit and Xavier Coubez\heartsuit

\clubsuitUniversité de Lorraine, Nancy, France
\spadesuitUniversity of Helsinki, Helsinki, Finland
\heartsuitInstitut de Cancérologie, Strasbourg, France
Correspondence: [email protected]
Abstract

The success of pretrained language models (PLMs) across a spate of use-cases has led to significant investment from the NLP community towards building domain-specific foundational models. On the other hand, in mission critical settings such as biomedical applications, other aspects also factor in—chief of which is a model’s ability to produce reasonable estimates of its own uncertainty. In the present study, we discuss these two desiderata through the lens of how they shape the entropy of a model’s output probability distribution. We find that domain specificity and uncertainty awareness can often be successfully combined, but the exact task at hand weighs in much more strongly.

\robustify

Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?


Aman Sinha\clubsuit,\heartsuit, Timothee Mickus\spadesuit, Marianne Clausel\clubsuit, Mathieu Constant\clubsuit and Xavier Coubez\heartsuit \clubsuitUniversité de Lorraine, Nancy, France \spadesuitUniversity of Helsinki, Helsinki, Finland \heartsuitInstitut de Cancérologie, Strasbourg, France Correspondence: [email protected]


1 Introduction

Deep-learning models are trained with data-driven approaches to maximize prediction accuracy Goodfellow et al. (2016). This entails several well-documented pitfalls, ranging from closed-domain limitations Daume III and Marcu (2006) to social systemic biases McCoy et al. (2019); Schnabel et al. (2016). These limitations compound to a severe deterioration of model performances in out-of-domain (OOD) scenarios Hurd et al. (2013); Shah et al. (2020). This has led to engineering efforts towards developing models tailored to specific domains, ranging from the legal (Paul et al., 2023) to the biomedical (Lee et al., 2020; Singhal et al., 2023) ones.

Domain-specific models, while useful, are rarely considered as a definitive answer. Crucially, in the biomedical domain, experts require more reliability from these models—in particular, insofar as accounting for uncertainty in prediction is concerned. For example, in the case of a risk scoring model used to rank patients for live transplant, uncertainty-awareness becomes critical. The lack of uncertainty-aware models may lead to improper allocation of medical resources Steyerberg et al. (2010). Such concerns exemplify the importance of uncertainty aware models and its critical role in model selection.

The compatibility of domain-specific pretraining and uncertainty modeling appears under-assessed. To illustrate this, one can consider the entropy of output distributions: Domain-specific pretraining will lead to more probability mass assigned to a single (hopefully correct) estimate, leading to a lower entropy; whereas uncertainty-aware designs intend to not neglect valid alternatives—meaning that the probability mass should be spread out, which entails a higher entropy when uncertainty is due.

Uncertainty-unaware Uncertainty-aware

General-domain

Refer to caption Refer to caption

Domain-specific

Refer to caption Refer to caption
Figure 1: Illustration of this study’s setup. We perform a systematic comparison of domain-specificity and uncertainty-awareness in the medical domain.
Dataset Task Description Splits Statistics
train val test #Class CIR avglen maxlen
\worldflag

US

MedABS Predict the patient condition described, given a medical abstract 8662 2888 2888 5 3.1445 180.59 597
\worldflag

US

MedNLI Predict the inference type, given a hypothesis and a premise 11232 1395 1422 3 1 23.83 151
\worldflag

US

SMOKING Predict the patient smoking status, given a medical discharge record 398 100 104 5 23.75 654.30 2788
\worldflag

FR

PxSLU Predict the drug prescription intent, given a user speech transcription 1386 198 397 4 98.1538 11.40 48
\worldflag

FR

MedMCQA Predict the number of answers, given a medical multi-choice question 2171 312 622 5 21.1176 12.90 92
\worldflag

FR

MORFITT Predict the speciality, given a scientific article abstract 1514 1022 1088 12 15.3529 226.33 1425
Table 1: Datasets description. CIR denotes class imbalance ratio.

In this work, we reflect on how model-specificity and uncertainty-awareness articulate with one another. Figure 1 illustrates the experimental setup we use for our study. In practice, we study the performances of frequentist and Bayesian general and domain-specific models on biomedical text classification tasks across a wide array of metrics, ranging from macro F1 to SCE, with a specific focus on entropy Ruder and Plank (2017); Kuhn et al. (2023). More narrowly, we study the following research questions: RQ1: Are the benefits of uncertainty-awareness and domain-specificity orthogonal? RQ2: Given our benchmarking results, should medical practitioners prioritize domain-specificity or uncertainty-awareness?

2 Related Work

Recently, uncertainty quantification has gained attention from the NLP community Xiao and Wang (2019); Xiao et al. (2022); Hu et al. (2023)—particularly in mission critical settings, such as in the medical domain Hwang et al. (2023); Barandas et al. (2024). In parallel, compared to domain adaptation approaches Wiese et al. (2017) for the medical domain, there is a growing interest in domain-specific language models starting from BioBERT Lee et al. (2020) to the recent MedPalM Singhal et al. (2023). Xiao et al. (2022) presented an elaborate study of uncertainty paradigm for general-domain PLMs. While uncertainty modeling has been applied to biomedical data previously (e.g., Begoli et al., 2019; Abdar et al., 2021), surprisingly little has been done for biomedical textual data. Therefore, our study precisely focuses on the interaction between the two paradigms for medical domain NLP tasks. We address this gap by focusing specifically on predictive entropy Ruder and Plank (2017); Kuhn et al. (2023).

3 Methodology

Datasets.

We conduct experiments on six standard biomedical datasets: three English datasets, viz. MedABS Schopf et al. (2023), MedNLI Romanov and Shivade and SMOKING Uzuner et al. (2008); as well as three French datasets, viz. MORFITT Labrak et al. (2023b), PxSLU Kocabiyikoglu et al. (2022) and MedMCQA Labrak et al. (2023a).

For MEDABS, SMOKING, PxSLU, and MEDMCQA, we do not perform any special preprocessing. For MEDMCQA, we perform Task 2, i.e., predicting the number of possible responses (ranging from 1-5) for the input multi choice question. For MEDNLI, we concatenate the statement and hypothesis using the [SEP] token and use it as an input converting it to a multi-class task. For MORFITT, which is originally a multi-label classification task, we use the first label for each sample to convert it to a multi-class problem. The descriptive statistics of these datasets are listed in Table 1, along with class imbalance ratio (CIR; Yu et al., 2022). See Appendix A.4 for more information.

Models.

We derive classifiers from language-specific PLMs: for English datasets, we use BERT Devlin et al. (2018) and BioBERT Lee et al. (2020); for French, we use CamemBERT Martin et al. (2019) and CamemBERT-bio Touchent et al. (2023). We compare two types of models, frequentist deep learning models (DNN) and Bayesian deep learning models (BNNs). The DNN model comprises of a PLM-based encoder, a Dropout unit along with 1-layer classifier. The BNN models are likewise based on a PLM encoder, along with a Bayesian module applied over the classification layer. We also experimented with MC-dropout models Gal and Ghahramani (2016), DropConnect Mobiny et al. (2021), and variational inference Blundell et al. (2015) models. We focus111 We justify our focus on DropConnect empirically, as it yielded the highest validation F1 scores on average in our case. See Sections A.1 and B for details. All main text results for uncertainty-aware classifiers pertain to DropConnect. on the DropConnect architecture which comprises a PLM encoder along a DropConnect dense classification layer. This approach infuses stochasticity into a deterministic model by randomly zeroing out classifier weights with probability 1p1𝑝1-p1 - italic_p. This allows us to sample multiple outputs for a given input, thus enabling to aggregate the predictions and to produce estimates of uncertainty.

For simplicity, we note domain-specific models as +𝒟𝒟+\mathcal{D}+ caligraphic_D (and general models 𝒟𝒟-\mathcal{D}- caligraphic_D); uncertainty aware models are referred to as +𝒰𝒰+\mathcal{U}+ caligraphic_U (with frequentist models noted 𝒰𝒰-\mathcal{U}- caligraphic_U). We replicate training across 10 seeds per model and dataset; further implementation details can be found in Section A.2.

Evaluation Setup.

We evaluate classifiers on two aspects: task performance and uncertainty awareness. For text classification, we report Macro-F1 and accuracy. For uncertainty quantification we report Brier score (BS; Brier, 1950), Expected Calibration Error (ECE; Naeini et al., 2015), Static Calibration Error (SCE; Nixon et al., 2019), Negative log likelihood (NLL), coverage (Cov%) and entropy (H𝐻Hitalic_H). See Section A.3 for definitions.

Refer to caption
(a) Entropy
Refer to caption
(b) Classification metrics
Refer to caption
(c) Calibration metrics
Figure 2: Performances for empirically best models (selected metrics), z𝑧zitalic_z-normalized per dataset. See Table 5 in Appendix B for full non-normalized results.

4 Results

Refer to caption
(a) F1
Refer to caption
(b) Acc.
Refer to caption
(c) NLL
Refer to caption
(d) H𝐻Hitalic_H
Refer to caption
(e) Brier score
Refer to caption
(f) SCE
Refer to caption
(g) ECE
Refer to caption
(h) Cov %
Figure 3: SHAP attributions. Variables are ordered by mean absolute SHAPs. In blue, weight assigned when the variable is negative; in red, when it is positive. ‘ds.’ denotes a categorical variable tracking the dataset.

Performance.

All results are listed in Table 5 in Appendix B, we highlight some key metrics in Figure 2. Insofar as classification metrics go, +𝒟𝒟+\mathcal{D}+ caligraphic_D configurations outperform 𝒟𝒟-\mathcal{D}- caligraphic_D ones. More generally, as all scores are highly dependent on the exact dataset considered, we first de-trend them by z𝑧zitalic_z-normalizing results on a per-dataset basis to simplify analysis. We find +𝒟+𝒰𝒟𝒰+\mathcal{D}+\mathcal{U}+ caligraphic_D + caligraphic_U classifiers to be strong contenders, although they are often outperformed—especially by +𝒟𝒰𝒟𝒰+\mathcal{D}-\mathcal{U}+ caligraphic_D - caligraphic_U models on classification metrics (Figure 2b) and by 𝒟+𝒰𝒟𝒰-\mathcal{D}+\mathcal{U}- caligraphic_D + caligraphic_U models on calibration metrics (Figure 2c). As for entropy, we find both +𝒟𝒰𝒟𝒰+\mathcal{D}-\mathcal{U}+ caligraphic_D - caligraphic_U and +𝒟+𝒰𝒟𝒰+\mathcal{D}+\mathcal{U}+ caligraphic_D + caligraphic_U to lead to lower scores. Trends are consistent across languages.

Relative importance.

To interpret results in Figure 2 more rigorously, we rely on SHAP Lundberg and Lee (2017). SHAP is an algorithm to compute heuristics for Shapley values (Shapley, 1953), viz. a game theoretical additive and fair distribution of a given variable to be explained across predetermined factors of interest. Here, we analyze the scores obtained by individual classifiers on all 8 metrics, and try to attribute their values (z𝑧zitalic_z-normalized per dataset) to domain specificity (±𝒟plus-or-minus𝒟\pm\mathcal{D}± caligraphic_D), uncertainty awareness (±𝒰plus-or-minus𝒰\pm\mathcal{U}± caligraphic_U) and the dataset one observation corresponds to (ds.).

Results are displayed in Figure 3; specific points correspond to weights assigned to one of the factors for one of the datapoints, factors are sorted from most to least impactful from top to bottom. We can see that which of domain specificity and uncertainty awareness has the strongest impact depends strictly on the metrics: Cases where ±𝒟plus-or-minus𝒟\pm\mathcal{D}± caligraphic_D is assigned on average a greater absolute weight than ±𝒰plus-or-minus𝒰\pm\mathcal{U}± caligraphic_U account for exactly half of the metrics we study. Another import trend is that effects tied to +𝒟𝒟+\mathcal{D}+ caligraphic_D are also often attested for +𝒰𝒰+\mathcal{U}+ caligraphic_U: if domain specificity is useful, then uncertainty awareness is as well.222 There are two notable exceptions: ECE and coverage, where we find +𝒟𝒟+\mathcal{D}+ caligraphic_D to be detrimental. Variation across seeds might explain the discrepancy with Table 5. Lastly, weights assigned to both ±𝒟plus-or-minus𝒟\pm\mathcal{D}± caligraphic_D and ±𝒰plus-or-minus𝒰\pm\mathcal{U}± caligraphic_U are considerably smaller than those assigned to datasets, showcasing that these trends are often overpowered by the specifics of the task at hand.

Refer to caption
Figure 4: Entropy vs. probability mass assigned to the target (z𝑧zitalic_z-normalized per classifier). Orange: correct predictions; Blue: incorrect.

Entropy.

A desideratum we laid out above is to have large entropy scores when the model is incorrect. Focusing on entropy, we display how it compares to the probability mass assigned to the target in Figure 4. In detail, we retrieve all predictions for every datapoint across all classifiers and then z𝑧zitalic_z-normalize entropy scores and probability assigned the target class.333 When plotting entropy against probability mass assigned to the target class, we can keep in mind some useful points of reference. A perfect classifier that is always confidently correct should display a high probability mass and a low entropy (i.e., top left of our plot); what we hope to avoid is a confidently incorrect classifier (bottom left). As entropy and probability are statistically related, it is impossible to observe a high probability mass and a high entropy (top right). Lastly, assuming the classifier outputs continuous scores, this statistical dependency also dictates that probability mass and entropy be inversely correlated for correct predictions. We can see that incorrect predictions do result in more spread out entropy scores. Moreover, we can notice some tentative differences between the four types of classifiers of our study: Correct predictions from +𝒟+𝒰𝒟𝒰+\mathcal{D}+\mathcal{U}+ caligraphic_D + caligraphic_U models seem to lead to an especially tight correlation between entropy and probability mass.

effect size f𝑓fitalic_f Spearman’s ρ𝜌\rhoitalic_ρ

𝒟𝒰𝒟𝒰-\mathcal{D}~{}-\mathcal{U}- caligraphic_D - caligraphic_U

+𝒟𝒰𝒟𝒰+\mathcal{D}~{}-\mathcal{U}+ caligraphic_D - caligraphic_U

𝒟+𝒰𝒟𝒰-\mathcal{D}~{}+\mathcal{U}- caligraphic_D + caligraphic_U

+𝒟+𝒰𝒟𝒰+\mathcal{D}~{}+\mathcal{U}+ caligraphic_D + caligraphic_U

𝒟𝒰𝒟𝒰-\mathcal{D}~{}-\mathcal{U}- caligraphic_D - caligraphic_U

+𝒟𝒰𝒟𝒰+\mathcal{D}~{}-\mathcal{U}+ caligraphic_D - caligraphic_U

𝒟+𝒰𝒟𝒰-\mathcal{D}~{}+\mathcal{U}- caligraphic_D + caligraphic_U

+𝒟+𝒰𝒟𝒰+\mathcal{D}~{}+\mathcal{U}+ caligraphic_D + caligraphic_U

MedABS 62.458762.458762.458762.4587 64.864.864.864.864.8 62.386162.386162.386162.3861 67.320467.320467.320467.3204 48.0-48.0-48.0- 48.0-48.0 47.9223-47.9223-47.9223- 47.9223 44.6260-44.6260-44.6260- 44.6260 53.5326-53.5326-53.5326- 53.5326
MedNLI 73.247273.247273.247273.2472 73.210473.210473.210473.2104 74.074.074.074.074.0 76.990676.990676.990676.9906 73.1618-73.1618-73.1618- 73.1618 77.4-77.4-77.4- 77.4-77.4 76.0566-76.0566-76.0566- 76.0566 83.2975-83.2975-83.2975- 83.2975
SMOKING 75.814075.814075.814075.8140 71.558071.558071.558071.5580 74.159174.159174.159174.1591 74.874.874.874.874.8 56.4635-56.4635-56.4635- 56.4635 37.9853-37.9853-37.9853- 37.9853 50.0037-50.0037-50.0037- 50.0037 56.0-56.0-56.0- 56.0-56.0
PxSLU 65.419065.419065.419065.4190 87.231487.231487.231487.2314 65.058165.058165.058165.0581 85.885.885.885.885.8 85.4051-85.4051-85.4051- 85.4051 69.0819-69.0819-69.0819- 69.0819 87.3-87.3-87.3- 87.3-87.3 96.1698-96.1698-96.1698- 96.1698
MedMCQA 65.561265.561265.561265.5612 63.846763.846763.846763.8467 66.666.666.666.666.6 68.190868.190868.190868.1908 82.2927-82.2927-82.2927- 82.2927 82.2-82.2-82.2- 82.2-82.2 60.7625-60.7625-60.7625- 60.7625 62.5836-62.5836-62.5836- 62.5836
MORFITT 65.665.665.665.665.6 66.107566.107566.107566.1075 65.023265.023265.023265.0232 64.789664.789664.789664.7896 54.6-54.6-54.6- 54.6-54.6 55.0504-55.0504-55.0504- 55.0504 50.7645-50.7645-50.7645- 50.7645 50.9648-50.9648-50.9648- 50.9648
Table 2: Statistical tests on entropy measurements, with best and second best highlighted.

However, establishing whether this difference is significant requires further testing. We therefore measure whether incorrect predictions lead to higher entropy in two ways: (i) using Mann–Whitney U-tests, from which we derive a common language effect size f𝑓fitalic_f (as the entropy of incorrect predictions should be higher);444All U-tests suggest entropy for incorrect predictions is significantly higher (p<1010𝑝superscript1010p<10^{-10}italic_p < 10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT). and (ii), by computing Spearman correlation coefficients between the entropy and the mass assigned to the target class (as entropy should degrade with correctness). Corresponding results are listed in Table 2: Across most of the datasets we study, the top or second most coherent distributions we observe are for domain-specific and uncertainty-aware models. However, we also observe that actual performances are highly sensitive to the exact classification task at hand.

5 Discussion & Conclusion

We can now answer our initial research questions.

RQ1: Are the benefits of uncertainty-awareness and domain-specificity orthogonal? We have seen in Table 2 that in most cases, using a classifier that was both domain-specific and uncertainty-aware led to the optimal distribution shape, with entropy more gracefully increasing with incorrectness.

RQ2: Should medical practitioners prioritize domain-specificity or uncertainty-awareness? SHAP attributions in Figure 3 strongly suggest that the evaluation metric dictates the strategy to follow. As one would expect, accuracy is better captured with domain-specific models, whereas uncertainty-aware models tend to be better calibrated.

We also found significant evidence throughout our experiments that the exact classification task at hand weighs in much more strongly than the design of the classifier. This extraneous factor necessarily complicates the relationship between domain-specificity and uncertainty-awareness: In a handful of cases in Figure 2, we observe classifiers that are neither uncertainty-aware nor domain specific faring best among all the models we survey—and conversely domain-specific uncertainty-aware classifiers can also rank dead last. This is also related to the often limited quantitative difference between best and worst models, which for instance can be as low as ±2.3%plus-or-minuspercent2.3\pm 2.3\%± 2.3 % for F1 on MEDABS (cf. Table 5).

Overall, our experiments suggest a very nuanced conclusion. Domain-specificity and uncertainty-awareness do appear to shape classifiers’ distributions and their entropy in distinct but compatible ways, but they have a lesser impact than the task itself. Hence, while we can often combine uncertainty-awareness and domain-specificity, there are no out-of-the-box solutions, and optimal performances require careful application designs.

Acknowledgments

We thank Sami Virpioja for his comments on an early version of this work.

This work is supported by the ICT 2023 project “Uncertainty-aware neural language models” funded by the Academy of Finland (grant agreement № 345999).

References

  • Abdar et al. (2021) Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U. Rajendra Acharya, Vladimir Makarenkov, and Saeid Nahavandi. 2021. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243–297.
  • Barandas et al. (2024) Marília Barandas, Lorenzo Famiglini, Andrea Campagner, Duarte Folgado, Raquel Simão, Federico Cabitza, and Hugo Gamboa. 2024. Evaluation of uncertainty quantification methods in multi-label classification: A case study with automatic diagnosis of electrocardiogram. Information Fusion, 101:101978.
  • Begoli et al. (2019) Edmon Begoli, Tanmoy Bhattacharya, and Dimitri Kusnezov. 2019. The need for uncertainty quantification in machine-assisted medical decision making. Nature Machine Intelligence, 1(1):20–23.
  • Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. 2015. Weight uncertainty in neural network. In International conference on machine learning, pages 1613–1622. PMLR.
  • Brier (1950) Glenn W Brier. 1950. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3.
  • Daume III and Marcu (2006) Hal Daume III and Daniel Marcu. 2006. Domain adaptation for statistical classifiers. Journal of artificial Intelligence research, 26:101–126.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
  • Hu et al. (2023) Mengting Hu, Zhen Zhang, Shiwan Zhao, Minlie Huang, and Bingzhe Wu. 2023. Uncertainty in natural language processing: Sources, quantification, and applications. arXiv preprint arXiv:2306.04459.
  • Hurd et al. (2013) Michael D Hurd, Paco Martorell, Adeline Delavande, Kathleen J Mullen, and Kenneth M Langa. 2013. Monetary costs of dementia in the united states. New England Journal of Medicine, 368(14):1326–1334.
  • Hwang et al. (2023) Jinha Hwang, Carol Gudumotu, and Benyamin Ahmadnia. 2023. Uncertainty quantification of text classification in a multi-label setting for risk-sensitive systems. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 541–547.
  • Kocabiyikoglu et al. (2022) Alican Kocabiyikoglu, François Portet, Prudence Gibert, Hervé Blanchon, Jean-Marc Babouchkine, and Gaëtan Gavazzi. 2022. A spoken drug prescription dataset in french for spoken language understanding. In 13th Language Resources and Evaluation Conference (LREC 2022).
  • Kuhn et al. (2023) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664.
  • Labrak et al. (2023a) Yanis Labrak, Adrien Bazoge, Richard Dufour, Mickael Rouvier, Emmanuel Morin, Béatrice Daille, and Pierre-Antoine Gourraud. 2023a. Frenchmedmcqa: A french multiple-choice question answering dataset for medical domain. arXiv preprint arXiv:2304.04280.
  • Labrak et al. (2023b) Yanis Labrak, Mickael Rouvier, and Richard Dufour. 2023b. MORFITT : Un corpus multi-labels d’articles scientifiques français dans le domaine biomédical. In 18e Conférence en Recherche d’Information et Applications – 16e Rencontres Jeunes Chercheurs en RI – 30e Conférence sur le Traitement Automatique des Langues Naturelles – 25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, pages 66–70, Paris, France. ATALA.
  • Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  • Lundberg and Lee (2017) Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc.
  • Martin et al. (2019) Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de La Clergerie, Djamé Seddah, and Benoît Sagot. 2019. Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894.
  • McCoy et al. (2019) R Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007.
  • Mobiny et al. (2021) Aryan Mobiny, Pengyu Yuan, Supratik K Moulik, Naveen Garg, Carol C Wu, and Hien Van Nguyen. 2021. Dropconnect is effective in modeling uncertainty of bayesian deep networks. Scientific reports, 11(1):5458.
  • Naeini et al. (2015) Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29.
  • Nixon et al. (2019) Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. 2019. Measuring calibration in deep learning. In CVPR workshops, volume 2.
  • Paul et al. (2023) Shounak Paul, Arpan Mandal, Pawan Goyal, and Saptarshi Ghosh. 2023. Pre-trained language models for the legal domain: A case study on indian law. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL ’23, page 187–196, New York, NY, USA. Association for Computing Machinery.
  • (25) Alexey Romanov and Chaitanya Shivade. Lessons from natural language inference in the clinical domain.
  • Ruder and Plank (2017) Sebastian Ruder and Barbara Plank. 2017. Learning to select data for transfer learning with Bayesian optimization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 372–382, Copenhagen, Denmark. Association for Computational Linguistics.
  • Schnabel et al. (2016) Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as treatments: Debiasing learning and evaluation. In international conference on machine learning, pages 1670–1679. PMLR.
  • Schopf et al. (2023) Tim Schopf, Daniel Braun, and Florian Matthes. 2023. Evaluating unsupervised text classification: Zero-shot and similarity-based approaches. In Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval, NLPIR ’22, page 6–15, New York, NY, USA. Association for Computing Machinery.
  • Shah et al. (2020) Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli. 2020. The pitfalls of simplicity bias in neural networks. Advances in Neural Information Processing Systems, 33:9573–9585.
  • Shapley (1953) Lloyd S Shapley. 1953. A value for n-person games. In Harold W. Kuhn and Albert W. Tucker, editors, Contributions to the Theory of Games II, pages 307–317. Princeton University Press, Princeton.
  • Singhal et al. (2023) K. Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather J. Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mohamed Amin, S. Lachgar, P. A. Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Agüera y Arcas, Nenad Tomašev, Yun Liu, Renee C Wong, Christopher Semturs, Seyedeh Sara Mahdavi, Joëlle K. Barral, Dale R. Webster, Greg S Corrado, Yossi Matias, Shekoofeh Azizi, Alan Karthikesalingam, and Vivek Natarajan. 2023. Towards expert-level medical question answering with large language models. ArXiv, abs/2305.09617.
  • Steyerberg et al. (2010) Ewout W Steyerberg, Andrew J Vickers, Nancy R Cook, Thomas Gerds, Mithat Gonen, Nancy Obuchowski, Michael J Pencina, and Michael W Kattan. 2010. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology, 21(1):128–138.
  • Touchent et al. (2023) Rian Touchent, Laurent Romary, and Eric De La Clergerie. 2023. CamemBERT-bio : Un modèle de langue français savoureux et meilleur pour la santé. In 18e Conférence en Recherche d’Information et Applications
    16e Rencontres Jeunes Chercheurs en RI
    30e Conférence sur le Traitement Automatique des Langues Naturelles
    25e Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues
    , pages 323–334, Paris, France. ATALA.
  • Uzuner et al. (2008) Özlem Uzuner, Ira Goldstein, Yuan Luo, and Isaac Kohane. 2008. Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association, 15(1):14–24.
  • Wiese et al. (2017) Georg Wiese, Dirk Weissenborn, and Mariana Neves. 2017. Neural domain adaptation for biomedical question answering. arXiv preprint arXiv:1706.03610.
  • Xiao and Wang (2019) Yijun Xiao and William Yang Wang. 2019. Quantifying uncertainties in natural language processing tasks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 7322–7329.
  • Xiao et al. (2022) Yuxin Xiao, Paul Pu Liang, Umang Bhatt, Willie Neiswanger, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2022. Uncertainty quantification with pre-trained language models: A large-scale empirical analysis. arXiv preprint arXiv:2210.04714.
  • Yu et al. (2022) Sihao Yu, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Zizhen Wang, and Xueqi Cheng. 2022. A re-balancing strategy for class-imbalanced classification based on instance difficulty. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 70–79.
Refer to caption
Figure 5: Comparison of various BNN models for different datasets on classification task based on Macro-F1 on validation set.

Appendix A Experimental details

A.1 Supplementary Bayesian models

We include the details for two more Bayesian models: MC-dropout and variational inference. Note that for all the Bayesian models we sample K=3 predictions at inference and use the mean prediction.

MCDropout (MCD)

This model is based on a PLM encoder, similar to the main study models. The difference in this case is that Stochastic Dropout is applied over the classification layer. MCD Gal and Ghahramani (2016) proposes to extend the usage of Dropout but at inference time enabling it to sample a multiple K𝐾Kitalic_K models, to make K𝐾Kitalic_K predictions. The final prediction in the case of classification model can denoted as

y^=K1k=1Kfi(x)^𝑦superscript𝐾1superscriptsubscript𝑘1𝐾subscript𝑓𝑖𝑥\hat{y}=K^{-1}\sum_{k=1}^{K}f_{i}(x)over^ start_ARG italic_y end_ARG = italic_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x )

.

Variational inference (VI)

This model is based on a PLM encoder, similar to the main study models, with variational inference dense layer as the classification layer. We use the Bayes by BackProp Blundell et al. (2015) for the VI Dense layer. It approximates the distribution of each weight with a Gaussian distribution with parameter 𝒩(μ,ρ)𝒩𝜇𝜌\mathcal{N}(\mu,\rho)caligraphic_N ( italic_μ , italic_ρ ). The weights are approximated with Monte Carlo gradient. Finally, the predictions are computed using the predictive posterior distribution, by sampling K weight instances and making one forward pass per set of weights same as MCD.

A.2 Implementation details

We use keras-uncertainty models for implementing our BNN model backbones.

Models MedABS MedNLI SMOKING PxSLU MedMCQA MORFITT
lr E lr E lr E lr E lr E lr E
𝒟𝒟-\mathcal{D}- caligraphic_D DNN 1e-5 4 5e-6 12 1e-4 15 5e-6 15 5e-6 14 5e-5 15
𝒟𝒟-\mathcal{D}- caligraphic_D DC 5e-6 7 1e-5 11 1e-5 15 5e-6 13 5e-6 15 5e-5 13
𝒟𝒟-\mathcal{D}- caligraphic_D MCD 5e-5 5 5e-6 15 5e-5 15 1e-5 14 5e-6 11 5e-5 10
𝒟𝒟-\mathcal{D}- caligraphic_D VI 5e-6 7 1e-5 14 5e-6 13 5e-6 14 1e-6 15 5e-5 13
+𝒟𝒟+\mathcal{D}+ caligraphic_D DNN 1e-5 4 1e-5 14 5e-5 15 1e-5 15 1e-5 10 5e-5 15
+𝒟𝒟+\mathcal{D}+ caligraphic_D DC 5e-5 3 1e-5 13 1e-4 13 1e-5 15 5e-6 15 5e-5 13
+𝒟𝒟+\mathcal{D}+ caligraphic_D MCD 5e-5 3 5e-5 12 5e-5 10 1e-5 14 1e-5 15 5e-5 13
+𝒟𝒟+\mathcal{D}+ caligraphic_D VI 1e-5 5 5e-6 13 5e-5 14 1e-5 14 1e-6 15 5e-5 5
Table 3: Best hyparameter for each model configuration and dataset pair. We denote both English and French domain-specific PLMs with +𝒟𝒟+\mathcal{D}+ caligraphic_D. The models DC, MCD, VI are from the +𝒰𝒰+\mathcal{U}+ caligraphic_U set.

Hyperparameter Setting

In all cases, we fine-tune the PLM backbone for all the downstream task with a maximum sequence length of 512 and a batch size of 50 sentences. We perform a grid hyper-parameter search for epochs= {3,4,5, ..., 15} and lr= {1e-7, 5e-6, 1e-6, 5e-5, 1e-5, 5e-4, 1e-4}. We replicate training with 3 seeds for each hyperparameter configuration, select the optimal configuration for validation F1, and replicate training with 7 more seeds for these optimal configurations, so as to obtain 10 models per dataset, PLM and architecture. We also select the main BNN model of the study by selecting the system yielding the highest average rank across all six datasets, as displayed in Figure 5.

We train all models with binary cross entropy loss and Adam optimizer with ϵ=108italic-ϵsuperscript108\epsilon=10^{-8}italic_ϵ = 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT and β=(0.9,0.999)𝛽0.90.999\beta=(0.9,0.999)italic_β = ( 0.9 , 0.999 ). For all BNN models, we obtain 3 sets of predictions after training the models to calculate the mean class probabilities. Corresponding optimal hyperparameters are listed in table 3.

A.3 Calibration metrics definition

In what follows, N𝑁Nitalic_N denotes the number of samples in test set, C𝐶Citalic_C denotes the number of classes. Lower score for Brier score, ECE, SCE, NLL and Entropy metrics; and higher score for coverage, are indicative of better uncertainty aware model.

Brier score.

Brier (1950) proposed BS which computes the mean square difference between the true classes and the predicted probabilities.

BS=1Ni=1Nc=1C(yc(i)y^c(i))2BS1𝑁subscriptsuperscript𝑁𝑖1subscriptsuperscript𝐶𝑐1superscriptsubscriptsuperscript𝑦𝑖𝑐subscriptsuperscript^𝑦𝑖𝑐2\text{BS}=\frac{1}{N}\sum^{N}_{i=1}\sum^{C}_{c=1}(y^{(i)}_{c}-\hat{y}^{(i)}_{c% })^{2}BS = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

Expected Calibration Error.

Naeini et al. (2015) provides weighted average of the difference between accuracy and confidence across B𝐵Bitalic_B bins.

ECE=b=1BnbN|acc(b)conf(b)|ECEsubscriptsuperscript𝐵𝑏1subscript𝑛𝑏𝑁acc𝑏conf𝑏\text{ECE}=\sum^{B}_{b=1}\frac{n_{b}}{N}|\text{acc}(b)-\text{conf}(b)|ECE = ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG | acc ( italic_b ) - conf ( italic_b ) |

where acc(b)𝑏(b)( italic_b ) and conf(b)𝑏(b)( italic_b ) are the average accuracy and confidence of predictions in bin b𝑏bitalic_b, respectively. We set B=15𝐵15B=15italic_B = 15 in our experiments.

Static Calibration Error.

Nixon et al. (2019) proposed an extension of ECE to multi-class problems to overcome its limitation of dependence of the number of bins.

SCE=c=1Cb=1BnbNC|acc(b)conf(b)|SCEsubscriptsuperscript𝐶𝑐1subscriptsuperscript𝐵𝑏1subscript𝑛𝑏𝑁𝐶acc𝑏conf𝑏\text{SCE}=\sum^{C}_{c=1}\sum^{B}_{b=1}\frac{n_{b}}{NC}|\text{acc}(b)-\text{% conf}(b)|SCE = ∑ start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_ARG italic_N italic_C end_ARG | acc ( italic_b ) - conf ( italic_b ) |

We set B=15𝐵15B=15italic_B = 15 in our experiments.

Negative Log Likelihood.

serves as the primary approach for optimizing neural networks in classification tasks. Interestingly, this loss function can also double as an effective metric for assessing uncertainty.

NLL=i=1Nyilog(y^i)NLLsuperscriptsubscript𝑖1𝑁subscript𝑦𝑖subscript^𝑦𝑖\text{NLL}=-\sum_{i=1}^{N}y_{i}\log(\hat{y}_{i})NLL = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

Coverage Percentage.

The normalized form of number of times the correct class in indeed contain within the prediction set.

Shannon Entropy.

quantifies the expected uncertainty inherent in the possible outcomes of a discrete random variable.

H=i=1Npilog(pi)𝐻superscriptsubscript𝑖1𝑁subscript𝑝𝑖subscript𝑝𝑖H=-\sum_{i=1}^{N}p_{i}\log(p_{i})italic_H = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

A.4 Dataset

We provided supplementary details about each dataset we used in Table 4.

Dataset Sample Classes Label Distribution
MedABS Schopf et al. (2023) {text: "Catheterization of coronary artery bypass graft from the descending aorta. The increasing frequency of reoperation for coronary artery disease has led to the use of a variety of grafts. This report describes the catheter technique for selective opacification of a saphenous vein graft from the descending thoracic aorta to the posterior coronary circulation. ", label: "Cardiovascular diseases" } {’Neoplasms’, ’Digestive system’, ’Nervous system’, ’Cardiovascular’, ’General pathological’ } [1925 913 1149 1804 2871]
MedNLI Romanov and Shivade {text: "No history of blood clots or DVTs, has never had chest pain prior to one week ago. [SEP] Patient has angina", label: "entailment"} {"entailment", "contradict", "neutral" } [3744 3744 3744]
SMOKING Uzuner et al. (2008) {text: "071962960 bh 4236518 417454 12/10/2001 12:00:00 am discharge summary unsigned dis report status : unsigned discharge summary name : sterpsap , ny unit number : 582-96-88 admission date : 12/10/2001 discharge date : 12/19/2001 principal diagnosis : prosthetic aortic valve dysfunction associated diagnoses : aortic valve insufficiency bacterial endocarditis , active principal procedure : urgent re-do aortic valve replacement and correction of left ventricular to aortic discontinuity . ( 12/13/2001 ) other procedures : aortic root aortogram ( 12/12/2001 ) cardiac ultrasound ( 12/13/2001 ) insertion dual chamber pacemaker ( 12/15/2001 ) picc line placement ( 12/18/2001 ) history and reason for hospitalization : mr. sterpsap …", label: "CURRENT SMOKER"} {’CURRENT SMOKER’, ’NON-SMOKER’, ’PAST SMOKER’, ’SMOKER’, ’UNKNOWN’ } [ 27 49 24 8 190]
MEDMCQA Labrak et al. (2023a) {text: "ans la liste suivante, quels sont les antibiotiques utilisables pour traiter une salmonellose chez un adulté?", label: 2} {1,2,3,4,5} [595 528 718 296 34]
MORFITT Labrak et al. (2023b) {text: "La survenue de complications postopératoires représente un cauchemar (bien réel), tant pour le patient que pour son chirurgien. Dès lors, quoi de plus fantasmagorique que d’administrer une « potion magique » au patient avant l’intervention pour éliminer ce risque ? Le but de cet article est de résumer l’état des connaissances actuelles concernant les bénéfices potentiels, liés à l’administration d’immunonutrition aux patients traités pour cancer urologique…..", original_label: [ "Immunologie","Chirurgie",], label: "Immunologie"} {’Vétérinaire’, ’Étiologie’, ’Psychologie’, ’Chirurgie’, ’Génétique’, ’Physiologie’, ’Pharmacologie’, ’Microbiologie’, ’Immunologie’, ’Chimie’, ’Virologie’, ’Parasitologie’ } [ 82 261 32 122 40 17 152 39 242 185 104 238]
PxSLU Kocabiyikoglu et al. (2022) {text: "antacapone 200 milligrammes 2 comprimés le matin 1 comprimé à midi 2 comprimé le soir traitement pour une durée totale de 4 semaines", label: "medical_prescription"} {"medical_prescription", "negate","replace", "none" } [1276 15 82 13]
Table 4: Sample data from each Dataset
Model Classification Uncertainty
Macro-F1(\uparrow) Accuracy(\uparrow) BS(\downarrow) ECE(\downarrow) SCE(\downarrow) NLL(\downarrow) Cov%(\uparrow) Entropy(\downarrow)
MedABS \worldflag US 𝒟𝒟-\mathcal{D}- caligraphic_D DNN 60.3633±plus-or-minus\pm±0.003 60.9765±plus-or-minus\pm±0.002 0.5535±plus-or-minus\pm±0.008 0.1387±plus-or-minus\pm±0.016 0.0683±plus-or-minus\pm±0.004 1.3261±plus-or-minus\pm±0.001 0.8976±plus-or-minus\pm±0.013 1.5579±plus-or-minus\pm±0.002
𝒟𝒟-\mathcal{D}- caligraphic_D DC 60.9855±plus-or-minus\pm±0.004 61.1842±plus-or-minus\pm±0.003 0.5518±plus-or-minus\pm±0.002 0.1342±plus-or-minus\pm±0.007 0.0674±plus-or-minus\pm±0.003 1.3192±plus-or-minus\pm±0.002 0.9611±plus-or-minus\pm±0.003 1.5556±plus-or-minus\pm±0.001
𝒟𝒟-\mathcal{D}- caligraphic_D MCD 60.6979±plus-or-minus\pm±0.004 60.0993±plus-or-minus\pm±0.006 0.5691±plus-or-minus\pm±0.015 0.1503±plus-or-minus\pm±0.014 0.0688±plus-or-minus\pm±0.01 1.3235±plus-or-minus\pm±0.008 0.9401±plus-or-minus\pm±0.013 1.5542±plus-or-minus\pm±0.002
𝒟𝒟-\mathcal{D}- caligraphic_D VI 60.8725±plus-or-minus\pm±0.001 61.1611±plus-or-minus\pm±0.001 0.5531±plus-or-minus\pm±0.006 0.1394±plus-or-minus\pm±0.004 0.0695±plus-or-minus\pm±0.003 1.3164±plus-or-minus\pm±0.003 0.958±plus-or-minus\pm±0.001 1.5541±plus-or-minus\pm±0.001
+𝒟𝒟+\mathcal{D}+ caligraphic_D DNN 60.8077±plus-or-minus\pm±0.013 61.3343±plus-or-minus\pm±0.01 0.5499±plus-or-minus\pm±0.014 0.1448±plus-or-minus\pm±0.005 0.0695±plus-or-minus\pm±0.001 1.3201±plus-or-minus\pm±0.014 0.9193±plus-or-minus\pm±0.005 1.5561±plus-or-minus\pm±0.003
+𝒟𝒟+\mathcal{D}+ caligraphic_D DC 62.5642±plus-or-minus\pm±0.009 62.1018±plus-or-minus\pm±0.01 0.5243±plus-or-minus\pm±0.015 0.1381±plus-or-minus\pm±0.016 0.0624±plus-or-minus\pm±0.007 1.2962±plus-or-minus\pm±0.007 0.9597±plus-or-minus\pm±0.008 1.5523±plus-or-minus\pm±0.002
+𝒟𝒟+\mathcal{D}+ caligraphic_D MCD 62.2038±plus-or-minus\pm±0.022 62.1307±plus-or-minus\pm±0.022 0.5226±plus-or-minus\pm±0.031 0.1238±plus-or-minus\pm±0.031 0.0593±plus-or-minus\pm±0.015 1.3056±plus-or-minus\pm±0.013 0.9666±plus-or-minus\pm±0.01 1.5562±plus-or-minus\pm±0.002
+𝒟𝒟+\mathcal{D}+ caligraphic_D VI 63.1893±plus-or-minus\pm±0.004 63.1694±plus-or-minus\pm±0.003 0.5234±plus-or-minus\pm±0.009 0.1464±plus-or-minus\pm±0.01 0.0653±plus-or-minus\pm±0.003 1.288±plus-or-minus\pm±0.006 0.9603±plus-or-minus\pm±0.005 1.5491±plus-or-minus\pm±0.002
MedNLI \worldflag US 𝒟𝒟-\mathcal{D}- caligraphic_D DNN 73.8951±plus-or-minus\pm±0.013 73.8397±plus-or-minus\pm±0.015 0.3976±plus-or-minus\pm±0.006 0.1278±plus-or-minus\pm±0.02 0.0846±plus-or-minus\pm±0.012 0.8177±plus-or-minus\pm±0.015 0.9119±plus-or-minus\pm±0.008 1.0156±plus-or-minus\pm±0.008
𝒟𝒟-\mathcal{D}- caligraphic_D DC 74.8161±plus-or-minus\pm±0.019 74.8711±plus-or-minus\pm±0.018 0.4242±plus-or-minus\pm±0.021 0.185±plus-or-minus\pm±0.007 0.1259±plus-or-minus\pm±0.005 0.7945±plus-or-minus\pm±0.014 0.8509±plus-or-minus\pm±0.007 0.9941±plus-or-minus\pm±0.002
𝒟𝒟-\mathcal{D}- caligraphic_D MCD 72.8896±plus-or-minus\pm±0.03 73.0192±plus-or-minus\pm±0.03 0.4163±plus-or-minus\pm±0.009 0.1214±plus-or-minus\pm±0.049 0.0865±plus-or-minus\pm±0.03 0.8298±plus-or-minus\pm±0.037 0.9109±plus-or-minus\pm±0.04 1.0171±plus-or-minus\pm±0.02
𝒟𝒟-\mathcal{D}- caligraphic_D VI 73.0816±plus-or-minus\pm±0.022 73.1364±plus-or-minus\pm±0.022 0.4426±plus-or-minus\pm±0.016 0.185±plus-or-minus\pm±0.023 0.1265±plus-or-minus\pm±0.015 0.8109±plus-or-minus\pm±0.022 0.857±plus-or-minus\pm±0.035 0.9983±plus-or-minus\pm±0.011
+𝒟𝒟+\mathcal{D}+ caligraphic_D DNN 77.172±plus-or-minus\pm±0.041 77.2386±plus-or-minus\pm±0.039 0.3783±plus-or-minus\pm±0.05 0.1579±plus-or-minus\pm±0.009 0.107±plus-or-minus\pm±0.007 0.7736±plus-or-minus\pm±0.039 0.857±plus-or-minus\pm±0.015 0.9952±plus-or-minus\pm±0.008
+𝒟𝒟+\mathcal{D}+ caligraphic_D DC 79.9945±plus-or-minus\pm±0.037 80.0047±plus-or-minus\pm±0.037 0.3375±plus-or-minus\pm±0.045 0.1392±plus-or-minus\pm±0.005 0.0956±plus-or-minus\pm±0.002 0.7486±plus-or-minus\pm±0.041 0.8872±plus-or-minus\pm±0.011 0.9924±plus-or-minus\pm±0.011
+𝒟𝒟+\mathcal{D}+ caligraphic_D MCD 80.1022±plus-or-minus\pm±0.014 80.1688±plus-or-minus\pm±0.014 0.3453±plus-or-minus\pm±0.02 0.1565±plus-or-minus\pm±0.009 0.1065±plus-or-minus\pm±0.005 0.7437±plus-or-minus\pm±0.012 0.8654±plus-or-minus\pm±0.004 0.9872±plus-or-minus\pm±0.001
+𝒟𝒟+\mathcal{D}+ caligraphic_D VI 77.0617±plus-or-minus\pm±0.043 77.1027±plus-or-minus\pm±0.042 0.351±plus-or-minus\pm±0.046 0.1041±plus-or-minus\pm±0.019 0.0773±plus-or-minus\pm±0.01 0.7851±plus-or-minus\pm±0.046 0.9293±plus-or-minus\pm±0.025 1.0101±plus-or-minus\pm±0.015
SMOKING \worldflag US 𝒟𝒟-\mathcal{D}- caligraphic_D DNN 27.1141±plus-or-minus\pm±0.041 45.8333±plus-or-minus\pm±0.142 0.7724±plus-or-minus\pm±0.054 0.2961±plus-or-minus\pm±0.057 0.154±plus-or-minus\pm±0.012 1.4298±plus-or-minus\pm±0.106 0.7724±plus-or-minus\pm±0.163 1.5536±plus-or-minus\pm±0.028
𝒟𝒟-\mathcal{D}- caligraphic_D DC 25.7924±plus-or-minus\pm±0.041 46.7949±plus-or-minus\pm±0.039 0.6407±plus-or-minus\pm±0.035 0.1625±plus-or-minus\pm±0.043 0.1215±plus-or-minus\pm±0.016 1.4331±plus-or-minus\pm±0.035 0.9455±plus-or-minus\pm±0.031 1.5791±plus-or-minus\pm±0.01
𝒟𝒟-\mathcal{D}- caligraphic_D MCD 26.707±plus-or-minus\pm±0.058 45.8333±plus-or-minus\pm±0.073 0.7609±plus-or-minus\pm±0.077 0.2771±plus-or-minus\pm±0.048 0.1507±plus-or-minus\pm±0.021 1.4519±plus-or-minus\pm±0.045 0.8942±plus-or-minus\pm±0.058 1.5651±plus-or-minus\pm±0.003
𝒟𝒟-\mathcal{D}- caligraphic_D VI 23.4485±plus-or-minus\pm±0.034 32.0513±plus-or-minus\pm±0.043 0.7197±plus-or-minus\pm±0.053 0.2171±plus-or-minus\pm±0.021 0.15±plus-or-minus\pm±0.023 1.5031±plus-or-minus\pm±0.038 0.8974±plus-or-minus\pm±0.113 1.5887±plus-or-minus\pm±0.004
+𝒟𝒟+\mathcal{D}+ caligraphic_D DNN 24.9822±plus-or-minus\pm±0.041 51.6026±plus-or-minus\pm±0.071 0.6764±plus-or-minus\pm±0.076 0.2262±plus-or-minus\pm±0.013 0.1334±plus-or-minus\pm±0.031 1.3928±plus-or-minus\pm±0.068 0.6571±plus-or-minus\pm±0.114 1.5596±plus-or-minus\pm±0.011
+𝒟𝒟+\mathcal{D}+ caligraphic_D DC 27.0293±plus-or-minus\pm±0.033 47.1154±plus-or-minus\pm±0.075 0.841±plus-or-minus\pm±0.043 0.3441±plus-or-minus\pm±0.053 0.1738±plus-or-minus\pm±0.007 1.4297±plus-or-minus\pm±0.06 0.7276±plus-or-minus\pm±0.118 1.5419±plus-or-minus\pm±0.02
+𝒟𝒟+\mathcal{D}+ caligraphic_D MCD 25.0029±plus-or-minus\pm±0.051 40.3846±plus-or-minus\pm±0.058 0.6777±plus-or-minus\pm±0.022 0.206±plus-or-minus\pm±0.019 0.1401±plus-or-minus\pm±0.014 1.482±plus-or-minus\pm±0.014 0.9487±plus-or-minus\pm±0.04 1.5895±plus-or-minus\pm±0.003
+𝒟𝒟+\mathcal{D}+ caligraphic_D VI 26.1167±plus-or-minus\pm±0.03 50.3205±plus-or-minus\pm±0.094 0.765±plus-or-minus\pm±0.175 0.3201±plus-or-minus\pm±0.094 0.1584±plus-or-minus\pm±0.045 1.3857±plus-or-minus\pm±0.094 0.75±plus-or-minus\pm±0.063 1.5397±plus-or-minus\pm±0.003
PxSLU \worldflag FR 𝒟𝒟-\mathcal{D}- caligraphic_D DNN 32.2541±plus-or-minus\pm±0.075 88.2452±plus-or-minus\pm±0.012 0.5743±plus-or-minus\pm±0.077 0.4556±plus-or-minus\pm±0.094 0.2955±plus-or-minus\pm±0.014 1.2807±plus-or-minus\pm±0.05 0.995±plus-or-minus\pm±0.004 1.3821±plus-or-minus\pm±0.003
𝒟𝒟-\mathcal{D}- caligraphic_D DC 34.1464±plus-or-minus\pm±0.026 84.2989±plus-or-minus\pm±0.05 0.4599±plus-or-minus\pm±0.088 0.3936±plus-or-minus\pm±0.047 0.2354±plus-or-minus\pm±0.03 1.2154±plus-or-minus\pm±0.062 1.0±plus-or-minus\pm±0.0 1.3768±plus-or-minus\pm±0.007
𝒟𝒟-\mathcal{D}- caligraphic_D MCD 33.211±plus-or-minus\pm±0.067 88.6902±plus-or-minus\pm±0.018 0.5232±plus-or-minus\pm±0.103 0.4852±plus-or-minus\pm±0.079 0.2615±plus-or-minus\pm±0.027 1.2571±plus-or-minus\pm±0.062 1.0±plus-or-minus\pm±0.0 1.3806±plus-or-minus\pm±0.004
𝒟𝒟-\mathcal{D}- caligraphic_D VI 25.9883±plus-or-minus\pm±0.041 88.9169±plus-or-minus\pm±0.013 0.5393±plus-or-minus\pm±0.021 0.5014±plus-or-minus\pm±0.026 0.2552±plus-or-minus\pm±0.007 1.2666±plus-or-minus\pm±0.014 1.0±plus-or-minus\pm±0.0 1.3814±plus-or-minus\pm±0.001
+𝒟𝒟+\mathcal{D}+ caligraphic_D DNN 33.1131±plus-or-minus\pm±0.097 80.1763±plus-or-minus\pm±0.238 0.5389±plus-or-minus\pm±0.116 0.3929±plus-or-minus\pm±0.057 0.2867±plus-or-minus\pm±0.037 1.2548±plus-or-minus\pm±0.06 0.9831±plus-or-minus\pm±0.018 1.38±plus-or-minus\pm±0.003
+𝒟𝒟+\mathcal{D}+ caligraphic_D DC 40.3372±plus-or-minus\pm±0.07 89.1184±plus-or-minus\pm±0.039 0.2649±plus-or-minus\pm±0.127 0.2576±plus-or-minus\pm±0.105 0.1568±plus-or-minus\pm±0.058 1.0539±plus-or-minus\pm±0.111 0.9997±plus-or-minus\pm±0.001 1.3496±plus-or-minus\pm±0.021
+𝒟𝒟+\mathcal{D}+ caligraphic_D MCD 34.1571±plus-or-minus\pm±0.029 89.1436±plus-or-minus\pm±0.026 0.5403±plus-or-minus\pm±0.043 0.5074±plus-or-minus\pm±0.015 0.2663±plus-or-minus\pm±0.013 1.2694±plus-or-minus\pm±0.026 1.0±plus-or-minus\pm±0.0 1.3821±plus-or-minus\pm±0.002
+𝒟𝒟+\mathcal{D}+ caligraphic_D VI 41.8279±plus-or-minus\pm±0.073 91.0999±plus-or-minus\pm±0.015 0.1634±plus-or-minus\pm±0.051 0.1403±plus-or-minus\pm±0.064 0.0861±plus-or-minus\pm±0.029 0.9464±plus-or-minus\pm±0.066 0.9958±plus-or-minus\pm±0.004 1.3246±plus-or-minus\pm±0.019
MEDMCQA \worldflag FR 𝒟𝒟-\mathcal{D}- caligraphic_D DNN 28.5727±plus-or-minus\pm±0.03 63.88±plus-or-minus\pm±0.055 0.6787±plus-or-minus\pm±0.1 0.3256±plus-or-minus\pm±0.043 0.1575±plus-or-minus\pm±0.021 1.5347±plus-or-minus\pm±0.062 0.9625±plus-or-minus\pm±0.033 1.6063±plus-or-minus\pm±0.003
𝒟𝒟-\mathcal{D}- caligraphic_D DC 32.0291±plus-or-minus\pm±0.003 63.5584±plus-or-minus\pm±0.007 0.4822±plus-or-minus\pm±0.015 0.165±plus-or-minus\pm±0.01 0.1099±plus-or-minus\pm±0.0 1.3846±plus-or-minus\pm±0.009 0.9764±plus-or-minus\pm±0.007 1.5888±plus-or-minus\pm±0.001
𝒟𝒟-\mathcal{D}- caligraphic_D MCD 28.3648±plus-or-minus\pm±0.029 61.3612±plus-or-minus\pm±0.103 0.7533±plus-or-minus\pm±0.044 0.3819±plus-or-minus\pm±0.084 0.1518±plus-or-minus\pm±0.02 1.5848±plus-or-minus\pm±0.024 1.0±plus-or-minus\pm±0.0 1.6091±plus-or-minus\pm±0.0
𝒟𝒟-\mathcal{D}- caligraphic_D VI 23.1977±plus-or-minus\pm±0.042 48.5531±plus-or-minus\pm±0.046 0.7499±plus-or-minus\pm±0.023 0.242±plus-or-minus\pm±0.033 0.1329±plus-or-minus\pm±0.004 1.5822±plus-or-minus\pm±0.013 1.0±plus-or-minus\pm±0.0 1.6089±plus-or-minus\pm±0.0
+𝒟𝒟+\mathcal{D}+ caligraphic_D DNN 28.1549±plus-or-minus\pm±0.045 61.0932±plus-or-minus\pm±0.089 0.6859±plus-or-minus\pm±0.12 0.3026±plus-or-minus\pm±0.009 0.1582±plus-or-minus\pm±0.01 1.5388±plus-or-minus\pm±0.077 0.9775±plus-or-minus\pm±0.02 1.6064±plus-or-minus\pm±0.004
+𝒟𝒟+\mathcal{D}+ caligraphic_D DC 29.7558±plus-or-minus\pm±0.07 60.343±plus-or-minus\pm±0.103 0.6687±plus-or-minus\pm±0.17 0.2973±plus-or-minus\pm±0.069 0.1278±plus-or-minus\pm±0.018 1.5216±plus-or-minus\pm±0.122 0.9893±plus-or-minus\pm±0.019 1.6025±plus-or-minus\pm±0.012
+𝒟𝒟+\mathcal{D}+ caligraphic_D MCD 31.0912±plus-or-minus\pm±0.016 68.4352±plus-or-minus\pm±0.033 0.5541±plus-or-minus\pm±0.115 0.3122±plus-or-minus\pm±0.059 0.1477±plus-or-minus\pm±0.031 1.4543±plus-or-minus\pm±0.081 0.9936±plus-or-minus\pm±0.011 1.5999±plus-or-minus\pm±0.007
+𝒟𝒟+\mathcal{D}+ caligraphic_D VI 23.1243±plus-or-minus\pm±0.035 49.8553±plus-or-minus\pm±0.031 0.7415±plus-or-minus\pm±0.017 0.2336±plus-or-minus\pm±0.026 0.1222±plus-or-minus\pm±0.008 1.5765±plus-or-minus\pm±0.01 1.0±plus-or-minus\pm±0.0 1.6085±plus-or-minus\pm±0.0
MORFITT \worldflag FR 𝒟𝒟-\mathcal{D}- caligraphic_D DNN 49.7506±plus-or-minus\pm±0.009 59.038±plus-or-minus\pm±0.012 0.6499±plus-or-minus\pm±0.022 0.2323±plus-or-minus\pm±0.021 0.0398±plus-or-minus\pm±0.005 2.0748±plus-or-minus\pm±0.015 0.796±plus-or-minus\pm±0.045 2.4454±plus-or-minus\pm±0.003
𝒟𝒟-\mathcal{D}- caligraphic_D DC 55.4551±plus-or-minus\pm±0.01 62.5306±plus-or-minus\pm±0.008 0.6134±plus-or-minus\pm±0.003 0.2243±plus-or-minus\pm±0.003 0.0425±plus-or-minus\pm±0.001 2.0332±plus-or-minus\pm±0.006 0.8775±plus-or-minus\pm±0.014 2.4411±plus-or-minus\pm±0.001
𝒟𝒟-\mathcal{D}- caligraphic_D MCD 48.3269±plus-or-minus\pm±0.008 57.3529±plus-or-minus\pm±0.008 0.6309±plus-or-minus\pm±0.021 0.1519±plus-or-minus\pm±0.05 0.0464±plus-or-minus\pm±0.007 2.2692±plus-or-minus\pm±0.03 0.9856±plus-or-minus\pm±0.006 2.4767±plus-or-minus\pm±0.003
𝒟𝒟-\mathcal{D}- caligraphic_D VI 53.0834±plus-or-minus\pm±0.014 61.6728±plus-or-minus\pm±0.01 0.6408±plus-or-minus\pm±0.042 0.2571±plus-or-minus\pm±0.039 0.0477±plus-or-minus\pm±0.006 2.0245±plus-or-minus\pm±0.007 0.7724±plus-or-minus\pm±0.047 2.4369±plus-or-minus\pm±0.004
+𝒟𝒟+\mathcal{D}+ caligraphic_D DNN 53.4963±plus-or-minus\pm±0.019 61.8015±plus-or-minus\pm±0.014 0.6081±plus-or-minus\pm±0.017 0.2098±plus-or-minus\pm±0.014 0.0363±plus-or-minus\pm±0.002 2.0538±plus-or-minus\pm±0.015 0.8334±plus-or-minus\pm±0.01 2.4453±plus-or-minus\pm±0.002
+𝒟𝒟+\mathcal{D}+ caligraphic_D DC 56.4418±plus-or-minus\pm±0.018 62.9596±plus-or-minus\pm±0.02 0.6148±plus-or-minus\pm±0.027 0.2325±plus-or-minus\pm±0.018 0.0433±plus-or-minus\pm±0.003 2.0251±plus-or-minus\pm±0.015 0.8667±plus-or-minus\pm±0.03 2.4394±plus-or-minus\pm±0.001
+𝒟𝒟+\mathcal{D}+ caligraphic_D MCD 51.8519±plus-or-minus\pm±0.015 60.5392±plus-or-minus\pm±0.006 0.5718±plus-or-minus\pm±0.003 0.0687±plus-or-minus\pm±0.022 0.0298±plus-or-minus\pm±0.0 2.1426±plus-or-minus\pm±0.01 0.9651±plus-or-minus\pm±0.005 2.4629±plus-or-minus\pm±0.002
+𝒟𝒟+\mathcal{D}+ caligraphic_D VI 54.2993±plus-or-minus\pm±0.011 62.7145±plus-or-minus\pm±0.01 0.5346±plus-or-minus\pm±0.008 0.0488±plus-or-minus\pm±0.018 0.0279±plus-or-minus\pm±0.002 2.1064±plus-or-minus\pm±0.014 0.9752±plus-or-minus\pm±0.007 2.4602±plus-or-minus\pm±0.002
Table 5: Comparison for text classification performance and uncertainty-awareness. We report the mean of 10 seed runs for all the metrics. We denote best score with bold and second best with underline. We denote both English and French domain-specific PLMs with +𝒟𝒟+\mathcal{D}+ caligraphic_D. The models DC, MCD, VI are from the +𝒰𝒰+\mathcal{U}+ caligraphic_U set.

Appendix B Full results

We present the detailed Table for all the configurations in Table 5. As noted in the main text, the most obvious trend across the board is that scores are tightly coupled with datasets: The range of scores achieved by all classifiers we study tends to be fairly limited across a given dataset, whereas we can observe often spectacular differences from one dataset to the next.

Insofar as classification metrics go, we observe that +𝒟𝒟+\mathcal{D}+ caligraphic_D models almost always occupy the top ranks. This is especially salient in MedABS and MedNLI, where all +𝒟𝒟+\mathcal{D}+ caligraphic_D classifiers outperform all 𝒟𝒟-\mathcal{D}- caligraphic_D classifiers both in terms of F1 and accuracy. In PxSLU, the only model that deviates from this trend is the +𝒟𝒰𝒟𝒰+\mathcal{D}-\mathcal{U}+ caligraphic_D - caligraphic_U model, which appears to suffer from an especially low accuracy. In the two other French datasets, along with SMOKING, classification metrics do not exhibit as clear a division between domain-specific and general PLMs.

As for calibration metrics, we find a very similar behavior to what we highlight in the main text: uncertainty-unaware model almost never rank among the top two contenders. Rankings per metric tend to be fairly stable as long as we control for domain-specificity.

Lastly, having a look at the various Bayesian architecture, we can see that DropConnect is not necessarily the most optimal system across all uncertainty-aware classifiers. Selecting the best architectures given 3 seeds, and then expanding to 10 seeds most likely led to some degree of sampling bias, explaining this discrepancy. It does however constitute a strong contender across many situations: it still remains the best ranking Bayesian architecture on average both in terms of F1 across the validation set, as well as in terms of test BS., ECE, SCE, NLL and Entropy.

In fact, differences in terms of ranks across datasets per architecture are not always significant: If we normalize all 80 classifiers per dataset by taking their rank, then Kruskal-Wallis H-test suggest that F1, accuracy and ECE do not lead to significant rank differences across architectures (assuming a threshold of p<0.05𝑝0.05p<0.05italic_p < 0.05). Likewise, comparing +𝒟𝒟+\mathcal{D}+ caligraphic_D and 𝒟𝒟-\mathcal{D}- caligraphic_D models with the same procedure does not lead to significant differences in terms of ECE, SCE, and coverage.