Skip to main content

Showing 1–12 of 12 results for author: Kusa, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.08080  [pdf, other

    cs.CL cs.AI cs.CY

    AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection

    Authors: Pia Pachinger, Janis Goldzycher, Anna Maria Planitzer, Wojciech Kusa, Allan Hanbury, Julia Neidhardt

    Abstract: Model interpretability in toxicity detection greatly profits from token-level annotations. However, currently such annotations are only available in English. We introduce a dataset annotated for offensive language detection sourced from a news forum, notable for its incorporation of the Austrian German dialect, comprising 4,562 user comments. In addition to binary offensiveness classification, we… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to Findings of the Association for Computational Linguistics: ACL 2024

    ACM Class: I.2.7

  2. arXiv:2404.00399  [pdf, other

    cs.CL cs.AI cs.LG

    Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

    Authors: Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak , et al. (20 additional authors not shown)

    Abstract: Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, where… ▽ More

    Submitted 23 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: Preprint

  3. arXiv:2311.12474  [pdf, other

    cs.CL cs.IR

    CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

    Authors: Wojciech Kusa, Oscar E. Mendoza, Matthias Samwald, Petr Knoth, Allan Hanbury

    Abstract: Systematic literature reviews (SLRs) play an essential role in summarising, synthesising and validating scientific evidence. In recent years, there has been a growing interest in using machine learning techniques to automate the identification of relevant studies for SLRs. However, the lack of standardised evaluation datasets makes comparing the performance of such automated literature screening s… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: Accepted at NeurIPS 2023 Datasets and Benchmarks Track

  4. arXiv:2309.01684  [pdf, other

    cs.IR cs.CL cs.DL

    CRUISE-Screening: Living Literature Reviews Toolbox

    Authors: Wojciech Kusa, Petr Knoth, Allan Hanbury

    Abstract: Keeping up with research and finding related work is still a time-consuming task for academics. Researchers sift through thousands of studies to identify a few relevant ones. Automation techniques can help by increasing the efficiency and effectiveness of this task. To this end, we developed CRUISE-Screening, a web-based application for conducting living literature reviews - a type of literature r… ▽ More

    Submitted 4 September, 2023; originally announced September 2023.

    Comments: Paper accepted at CIKM 2023. The arXiv version has an extra section about limitations in the Appendix that is not present in the ACM version

  5. arXiv:2307.00381  [pdf, other

    cs.IR cs.CL

    Effective Matching of Patients to Clinical Trials using Entity Extraction and Neural Re-ranking

    Authors: Wojciech Kusa, Óscar E. Mendoza, Petr Knoth, Gabriella Pasi, Allan Hanbury

    Abstract: Clinical trials (CTs) often fail due to inadequate patient recruitment. This paper tackles the challenges of CT retrieval by presenting an approach that addresses the patient-to-trials paradigm. Our approach involves two key components in a pipeline-based model: (i) a data enrichment technique for enhancing both queries and documents during the first retrieval stage, and (ii) a novel re-ranking sc… ▽ More

    Submitted 1 July, 2023; originally announced July 2023.

    Comments: Under review

  6. Outcome-based Evaluation of Systematic Review Automation

    Authors: Wojciech Kusa, Guido Zuccon, Petr Knoth, Allan Hanbury

    Abstract: Current methods of evaluating search strategies and automated citation screening for systematic literature reviews typically rely on counting the number of relevant and not relevant publications. This established practice, however, does not accurately reflect the reality of conducting a systematic review, because not all included publications have the same influence on the final outcome of the sys… ▽ More

    Submitted 30 June, 2023; originally announced June 2023.

    Comments: Accepted at ICTIR2023

  7. arXiv:2304.08188  [pdf, ps, other

    cs.IR

    Statute-enhanced lexical retrieval of court cases for COLIEE 2022

    Authors: Tobias Fink, Gabor Recski, Wojciech Kusa, Allan Hanbury

    Abstract: We discuss our experiments for COLIEE Task 1, a court case retrieval competition using cases from the Federal Court of Canada. During experiments on the training data we observe that passage level retrieval with rank fusion outperforms document level retrieval. By explicitly adding extracted statute information to the queries and documents we can further improve the results. We submit two passage… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: Sixteenth International Workshop on Juris-informatics (JURISIN). 2022

  8. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  9. arXiv:2206.15076  [pdf, other

    cs.CL

    BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing

    Authors: Jason Alan Fries, Leon Weber, Natasha Seelam, Gabriel Altay, Debajyoti Datta, Samuele Garda, Myungsun Kang, Ruisi Su, Wojciech Kusa, Samuel Cahyawijaya, Fabio Barth, Simon Ott, Matthias Samwald, Stephen Bach, Stella Biderman, Mario Sänger, Bo Wang, Alison Callahan, Daniel León Periñán, Théo Gigant, Patrick Haller, Jenny Chim, Jose David Posada, John Michael Giorgi, Karthik Rangasai Sivaraman , et al. (18 additional authors not shown)

    Abstract: Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a diversity of novel pretraining tasks, highlighting the benefits of meta-dataset curation. While successful i… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

    Comments: Submitted to NeurIPS 2022 Datasets and Benchmarks Track

  10. ORCAS-I: Queries Annotated with Intent using Weak Supervision

    Authors: Daria Alexander, Wojciech Kusa, Arjen P. de Vries

    Abstract: User intent classification is an important task in information retrieval. In this work, we introduce a revised taxonomy of user intent. We take the widely used differentiation between navigational, transactional and informational queries as a starting point, and identify three different sub-classes for the informational queries: instrumental, factual and abstain. The resulting classification of us… ▽ More

    Submitted 27 September, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

    Comments: presented at SIGIR 2022 (resource track)

  11. arXiv:2201.07534  [pdf, other

    cs.IR

    Automation of Citation Screening for Systematic Literature Reviews using Neural Networks: A Replicability Study

    Authors: Wojciech Kusa, Allan Hanbury, Petr Knoth

    Abstract: In the process of Systematic Literature Review, citation screening is estimated to be one of the most time-consuming steps. Multiple approaches to automate it using various machine learning techniques have been proposed. The first research papers that apply deep neural networks to this problem were published in the last two years. In this work, we conduct a replicability study of the first two dee… ▽ More

    Submitted 19 January, 2022; originally announced January 2022.

    Comments: Accepted at ECIR 2022

  12. arXiv:1707.02063  [pdf, other

    cs.CL

    External Evaluation of Event Extraction Classifiers for Automatic Pathway Curation: An extended study of the mTOR pathway

    Authors: Wojciech Kusa, Michael Spranger

    Abstract: This paper evaluates the impact of various event extraction systems on automatic pathway curation using the popular mTOR pathway. We quantify the impact of training data sets as well as different machine learning classifiers and show that some improve the quality of automatically extracted pathways.

    Submitted 7 July, 2017; originally announced July 2017.