Search | arXiv e-print repository

arXiv:2407.13511 [pdf, other]

Can Open-Source LLMs Compete with Commercial Models? Exploring the Few-Shot Performance of Current GPT Models in Biomedical Tasks

Authors: Samy Ateia, Udo Kruschwitz

Abstract: Commercial large language models (LLMs), like OpenAI's GPT-4 powering ChatGPT and Anthropic's Claude 3 Opus, have dominated natural language processing (NLP) benchmarks across different domains. New competing Open-Source alternatives like Mixtral 8x7B or Llama 3 have emerged and seem to be closing the gap while often offering higher throughput and being less costly to use. Open-Source LLMs can als… ▽ More Commercial large language models (LLMs), like OpenAI's GPT-4 powering ChatGPT and Anthropic's Claude 3 Opus, have dominated natural language processing (NLP) benchmarks across different domains. New competing Open-Source alternatives like Mixtral 8x7B or Llama 3 have emerged and seem to be closing the gap while often offering higher throughput and being less costly to use. Open-Source LLMs can also be self-hosted, which makes them interesting for enterprise and clinical use cases where sensitive data should not be processed by third parties. We participated in the 12th BioASQ challenge, which is a retrieval augmented generation (RAG) setting, and explored the performance of current GPT models Claude 3 Opus, GPT-3.5-turbo and Mixtral 8x7b with in-context learning (zero-shot, few-shot) and QLoRa fine-tuning. We also explored how additional relevant knowledge from Wikipedia added to the context-window of the LLM might improve their performance. Mixtral 8x7b was competitive in the 10-shot setting, both with and without fine-tuning, but failed to produce usable results in the zero-shot setting. QLoRa fine-tuning and Wikipedia context did not lead to measurable performance gains. Our results indicate that the performance gap between commercial and open-source models in RAG setups exists mainly in the zero-shot setting and can be closed by simply collecting few-shot examples for domain-specific use cases. The code needed to rerun these experiments is available through GitHub. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: Version as accepted at the BioASQ Lab at CLEF 2024

arXiv:2404.08259 [pdf, ps, other]

Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study

Authors: Wan-Hua Her, Udo Kruschwitz

Abstract: Machine Translation has made impressive progress in recent years offering close to human-level performance on many languages, but studies have primarily focused on high-resource languages with broad online presence and resources. With the help of growing Large Language Models, more and more low-resource languages achieve better results through the presence of other languages. However, studies have… ▽ More Machine Translation has made impressive progress in recent years offering close to human-level performance on many languages, but studies have primarily focused on high-resource languages with broad online presence and resources. With the help of growing Large Language Models, more and more low-resource languages achieve better results through the presence of other languages. However, studies have shown that not all low-resource languages can benefit from multilingual systems, especially those with insufficient training and evaluation data. In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian. We investigate conditions of low-resource languages such as data scarcity and parameter sensitivity and focus on refined solutions that combat low-resource difficulties and creative solutions such as harnessing language similarity. Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance. We demonstrate noisiness in the data and present our approach to carry out text preprocessing extensively. Evaluation was conducted using combined metrics: BLEU, chrF and TER. Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement. Furthermore, we present a qualitative analysis of translation errors and system limitations. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: Preprint accepted at the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages (SIGUL 2024)

arXiv:2402.18179 [pdf, other]

Challenges in Pre-Training Graph Neural Networks for Context-Based Fake News Detection: An Evaluation of Current Strategies and Resource Limitations

Authors: Gregor Donabauer, Udo Kruschwitz

Abstract: Pre-training of neural networks has recently revolutionized the field of Natural Language Processing (NLP) and has before demonstrated its effectiveness in computer vision. At the same time, advances around the detection of fake news were mainly driven by the context-based paradigm, where different types of signals (e.g. from social media) form graph-like structures that hold contextual informatio… ▽ More Pre-training of neural networks has recently revolutionized the field of Natural Language Processing (NLP) and has before demonstrated its effectiveness in computer vision. At the same time, advances around the detection of fake news were mainly driven by the context-based paradigm, where different types of signals (e.g. from social media) form graph-like structures that hold contextual information apart from the news article to classify. We propose to merge these two developments by applying pre-training of Graph Neural Networks (GNNs) in the domain of context-based fake news detection. Our experiments provide an evaluation of different pre-training strategies for graph-based misinformation detection and demonstrate that transfer learning does currently not lead to significant improvements over training a model from scratch in the domain. We argue that a major current issue is the lack of suitable large-scale resources that can be used for pre-training. △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: Preprint accepted at LREC-COLING 2024

arXiv:2306.16108 [pdf, other]

Is ChatGPT a Biomedical Expert? -- Exploring the Zero-Shot Performance of Current GPT Models in Biomedical Tasks

Authors: Samy Ateia, Udo Kruschwitz

Abstract: We assessed the performance of commercial Large Language Models (LLMs) GPT-3.5-Turbo and GPT-4 on tasks from the 2023 BioASQ challenge. In Task 11b Phase B, which is focused on answer generation, both models demonstrated competitive abilities with leading systems. Remarkably, they achieved this with simple zero-shot learning, grounded with relevant snippets. Even without relevant snippets, their p… ▽ More We assessed the performance of commercial Large Language Models (LLMs) GPT-3.5-Turbo and GPT-4 on tasks from the 2023 BioASQ challenge. In Task 11b Phase B, which is focused on answer generation, both models demonstrated competitive abilities with leading systems. Remarkably, they achieved this with simple zero-shot learning, grounded with relevant snippets. Even without relevant snippets, their performance was decent, though not on par with the best systems. Interestingly, the older and cheaper GPT-3.5-Turbo system was able to compete with GPT-4 in the grounded Q&A setting on factoid and list answers. In Task 11b Phase A, focusing on retrieval, query expansion through zero-shot learning improved performance, but the models fell short compared to other systems. The code needed to rerun these experiments is available through GitHub. △ Less

Submitted 24 July, 2023; v1 submitted 28 June, 2023; originally announced June 2023.

Comments: Preprint accepted at the 11th BioASQ Workshop at the 14th Conference and Labs of the Evaluation Forum (CLEF) 2023; Changes: 1. Added related work and experimental setup sections. 2. Reworked discussion and future work section. 3. Fixed multiple typos and improved style. Changed license

arXiv:2212.06560 [pdf, ps, other]

Exploring Fake News Detection with Heterogeneous Social Media Context Graphs

Authors: Gregor Donabauer, Udo Kruschwitz

Abstract: Fake news detection has become a research area that goes way beyond a purely academic interest as it has direct implications on our society as a whole. Recent advances have primarily focused on textbased approaches. However, it has become clear that to be effective one needs to incorporate additional, contextual information such as spreading behaviour of news articles and user interaction patterns… ▽ More Fake news detection has become a research area that goes way beyond a purely academic interest as it has direct implications on our society as a whole. Recent advances have primarily focused on textbased approaches. However, it has become clear that to be effective one needs to incorporate additional, contextual information such as spreading behaviour of news articles and user interaction patterns on social media. We propose to construct heterogeneous social context graphs around news articles and reformulate the problem as a graph classification task. Exploring the incorporation of different types of information (to get an idea as to what level of social context is most effective) and using different graph neural network architectures indicates that this approach is highly effective with robust results on a common benchmark dataset. △ Less

Submitted 13 December, 2022; originally announced December 2022.

Comments: Preprint accepted at the 45th European Conference on Information Retrieval (ECIR 2023)

arXiv:2210.05581 [pdf, other]

Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts

Authors: Juntao Yu, Silviu Paun, Maris Camilleri, Paloma Carretero Garcia, Jon Chamberlain, Udo Kruschwitz, Massimo Poesio

Abstract: Although several datasets annotated for anaphoric reference/coreference exist, even the largest such datasets have limitations in terms of size, range of domains, coverage of anaphoric phenomena, and size of documents included. Yet, the approaches proposed to scale up anaphoric annotation haven't so far resulted in datasets overcoming these limitations. In this paper, we introduce a new release of… ▽ More Although several datasets annotated for anaphoric reference/coreference exist, even the largest such datasets have limitations in terms of size, range of domains, coverage of anaphoric phenomena, and size of documents included. Yet, the approaches proposed to scale up anaphoric annotation haven't so far resulted in datasets overcoming these limitations. In this paper, we introduce a new release of a corpus for anaphoric reference labelled via a game-with-a-purpose. This new release is comparable in size to the largest existing corpora for anaphoric reference due in part to substantial activity by the players, in part thanks to the use of a new resolve-and-aggregate paradigm to 'complete' markable annotations through the combination of an anaphoric resolver and an aggregation method for anaphoric reference. The proposed method could be adopted to greatly speed up annotation time in other projects involving games-with-a-purpose. In addition, the corpus covers genres for which no comparable size datasets exist (Fiction and Wikipedia); it covers singletons and non-referring expressions; and it includes a substantial number of long documents (> 2K in length). △ Less

Submitted 11 October, 2022; originally announced October 2022.

arXiv:2204.02712 [pdf, other]

A New Dataset for Topic-Based Paragraph Classification in Genocide-Related Court Transcripts

Authors: Miriam Schirmer, Udo Kruschwitz, Gregor Donabauer

Abstract: Recent progress in natural language processing has been impressive in many different areas with transformer-based approaches setting new benchmarks for a wide range of applications. This development has also lowered the barriers for people outside the NLP community to tap into the tools and resources applied to a variety of domain-specific applications. The bottleneck however still remains the lac… ▽ More Recent progress in natural language processing has been impressive in many different areas with transformer-based approaches setting new benchmarks for a wide range of applications. This development has also lowered the barriers for people outside the NLP community to tap into the tools and resources applied to a variety of domain-specific applications. The bottleneck however still remains the lack of annotated gold-standard collections as soon as one's research or professional interest falls outside the scope of what is readily available. One such area is genocide-related research (also including the work of experts who have a professional interest in accessing, exploring and searching large-scale document collections on the topic, such as lawyers). We present GTC (Genocide Transcript Corpus), the first annotated corpus of genocide-related court transcripts which serves three purposes: (1) to provide a first reference corpus for the community, (2) to establish benchmark performances (using state-of-the-art transformer-based approaches) for the new classification task of paragraph identification of violence-related witness statements, (3) to explore first steps towards transfer learning within the domain. We consider our contribution to be addressing in particular this year's hot topic on Language Technology for All. △ Less

Submitted 6 April, 2022; originally announced April 2022.

Comments: Preprint. Accepted to appear in Proceedings of LREC 2022

arXiv:2204.01841 [pdf, other]

Applying Automatic Text Summarization for Fake News Detection

Authors: Philipp Hartl, Udo Kruschwitz

Abstract: The distribution of fake news is not a new but a rapidly growing problem. The shift to news consumption via social media has been one of the drivers for the spread of misleading and deliberately wrong information, as in addition to it of easy use there is rarely any veracity monitoring. Due to the harmful effects of such fake news on society, the detection of these has become increasingly importan… ▽ More The distribution of fake news is not a new but a rapidly growing problem. The shift to news consumption via social media has been one of the drivers for the spread of misleading and deliberately wrong information, as in addition to it of easy use there is rarely any veracity monitoring. Due to the harmful effects of such fake news on society, the detection of these has become increasingly important. We present an approach to the problem that combines the power of transformer-based language models while simultaneously addressing one of their inherent problems. Our framework, CMTR-BERT, combines multiple text representations, with the goal of circumventing sequential limits and related loss of information the underlying transformer architecture typically suffers from. Additionally, it enables the incorporation of contextual information. Extensive experiments on two very different, publicly available datasets demonstrates that our approach is able to set new state-of-the-art performance benchmarks. Apart from the benefit of using automatic text summarization techniques we also find that the incorporation of contextual information contributes to performance gains. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: Preprint. Accepted to appear in Proceedings of LREC 2022

arXiv:2110.02042 [pdf, other]

doi 10.48415/2021/fhw5-x128

ur-iw-hnt at GermEval 2021: An Ensembling Strategy with Multiple BERT Models

Authors: Hoai Nam Tran, Udo Kruschwitz

Abstract: This paper describes our approach (ur-iw-hnt) for the Shared Task of GermEval2021 to identify toxic, engaging, and fact-claiming comments. We submitted three runs using an ensembling strategy by majority (hard) voting with multiple different BERT models of three different types: German-based, Twitter-based, and multilingual models. All ensemble models outperform single models, while BERTweet is th… ▽ More This paper describes our approach (ur-iw-hnt) for the Shared Task of GermEval2021 to identify toxic, engaging, and fact-claiming comments. We submitted three runs using an ensembling strategy by majority (hard) voting with multiple different BERT models of three different types: German-based, Twitter-based, and multilingual models. All ensemble models outperform single models, while BERTweet is the winner of all individual models in every subtask. Twitter-based models perform better than GermanBERT models, and multilingual models perform worse but by a small margin. △ Less

Submitted 5 October, 2021; originally announced October 2021.

Comments: 5 pages, 1 figure

Journal ref: In Proceedings of the GermEval 2021 Workshop on the Identification of Toxic, Engaging, and Fact-Claiming Comments: 17th Conference on Natural Language Processing KONVENS 2021, pages 83-87, Online (2021)

arXiv:2106.13528 [pdf]

Interactive query expansion for professional search applications

Authors: Tony Russell-Rose, Philip Gooch, Udo Kruschwitz

Abstract: Knowledge workers (such as healthcare information professionals, patent agents and recruitment professionals) undertake work tasks where search forms a core part of their duties. In these instances, the search task is often complex and time-consuming and requires specialist expert knowledge to formulate accurate search strategies. Interactive features such as query expansion can play a key role in… ▽ More Knowledge workers (such as healthcare information professionals, patent agents and recruitment professionals) undertake work tasks where search forms a core part of their duties. In these instances, the search task is often complex and time-consuming and requires specialist expert knowledge to formulate accurate search strategies. Interactive features such as query expansion can play a key role in supporting these tasks. However, generating query suggestions within a professional search context requires that consideration be given to the specialist, structured nature of the search strategies they employ. In this paper, we investigate a variety of query expansion methods applied to a collection of Boolean search strategies used in a variety of real-world professional search tasks. The results demonstrate the utility of context-free distributional language models and the value of using linguistic cues such as ngram order to optimise the balance between precision and recall. △ Less

Submitted 25 June, 2021; originally announced June 2021.

Comments: 34 pages, 5 figures

arXiv:2102.04211 [pdf, other]

Challenging Social Media Threats using Collective Well-being Aware Recommendation Algorithms and an Educational Virtual Companion

Authors: Dimitri Ognibene, Davide Taibi, Udo Kruschwitz, Rodrigo Souza Wilkens, Davinia Hernandez-Leo, Emily Theophilou, Lidia Scifo, Rene Alejandro Lobo, Francesco Lomonaco, Sabrina Eimler, H. Ulrich Hoppe, Nils Malzahn

Abstract: Social media have become an integral part of our lives, expanding our interlinking capabilities to new levels. There is plenty to be said about their positive effects. On the other hand, however, some serious negative implications of social media have been repeatedly highlighted in recent years, pointing at various threats to society and its more vulnerable members, such as teenagers. We thus prop… ▽ More Social media have become an integral part of our lives, expanding our interlinking capabilities to new levels. There is plenty to be said about their positive effects. On the other hand, however, some serious negative implications of social media have been repeatedly highlighted in recent years, pointing at various threats to society and its more vulnerable members, such as teenagers. We thus propose a theoretical framework based on an adaptive "Social Media Virtual Companion" for educating and supporting an entire community, teenage students, to interact in social media environments in order to achieve desirable conditions, defined in terms of a community-specific and participatory designed measure of Collective Well-Being (CWB). This Companion combines automatic processing with expert intervention and guidance. The virtual Companion will be powered by a Recommender System (CWB-RS) that will optimize a CWB metric instead of engagement or platform profit, which currently largely drives recommender systems thereby disregarding any societal collateral effect.We put an emphasis on experts and educators in the educationally managed social media community of the Companion. They play five key roles: (a) use the Companion in classroom-based educational activities; (b) guide the definition of the CWB; (c) provide a hierarchical structure of learning strategies, objectives and activities that will support and contain the adaptive sequencing algorithms of the CWB-RS based on hierarchical reinforcement learning; (d) act as moderators of direct conflicts between the members of the community; and, finally, (e) monitor and address ethical and educational issues that are beyond the intelligent agent's competence and control. Preliminary results on the performance of the Companion's components and studies of the educational and psychological underlying principles are presented. △ Less

Submitted 17 October, 2022; v1 submitted 25 January, 2021; originally announced February 2021.

arXiv:1905.04577 [pdf, other]

Information search in a professional context - exploring a collection of professional search tasks

Authors: Suzan Verberne, Jiyin He, Gineke Wiggers, Tony Russell-Rose, Udo Kruschwitz, Arjen P. de Vries

Abstract: Search conducted in a work context is an everyday activity that has been around since long before the Web was invented, yet we still seem to understand little about its general characteristics. With this paper we aim to contribute to a better understanding of this large but rather multi-faceted area of `professional search'. Unlike task-based studies that aim at measuring the effectiveness of sear… ▽ More Search conducted in a work context is an everyday activity that has been around since long before the Web was invented, yet we still seem to understand little about its general characteristics. With this paper we aim to contribute to a better understanding of this large but rather multi-faceted area of `professional search'. Unlike task-based studies that aim at measuring the effectiveness of search methods, we chose to take a step back by conducting a survey among professional searchers to understand their typical search tasks. By doing so we offer complementary insights into the subject area. We asked our respondents to provide actual search tasks they have worked on, information about how these were conducted and details on how successful they eventually were. We then manually coded the collection of 56 search tasks with task characteristics and relevance criteria, and used the coded dataset for exploration purposes. Despite the relatively small scale of this study, our data provides enough evidence that professional search is indeed very different from Web search in many key respects and that this is a field that offers many avenues for future research. △ Less

Submitted 11 May, 2019; originally announced May 2019.

Comments: 5 pages, 2 figures

arXiv:1701.02050 [pdf, other]

doi 10.1145/3020165.3022129

Personalised Query Suggestion for Intranet Search with Temporal User Profiling

Authors: Thanh Vu, Alistair Willis, Udo Kruschwitz, Dawei Song

Abstract: Recent research has shown the usefulness of using collective user interaction data (e.g., query logs) to recommend query modification suggestions for Intranet search. However, most of the query suggestion approaches for Intranet search follow an "one size fits all" strategy, whereby different users who submit an identical query would get the same query suggestion list. This is problematic, as even… ▽ More Recent research has shown the usefulness of using collective user interaction data (e.g., query logs) to recommend query modification suggestions for Intranet search. However, most of the query suggestion approaches for Intranet search follow an "one size fits all" strategy, whereby different users who submit an identical query would get the same query suggestion list. This is problematic, as even with the same query, different users may have different topics of interest, which may change over time in response to the user's interaction with the system. We address the problem by proposing a personalised query suggestion framework for Intranet search. For each search session, we construct two temporal user profiles: a click user profile using the user's clicked documents and a query user profile using the user's submitted queries. We then use the two profiles to re-rank the non-personalised query suggestion list returned by a state-of-the-art query suggestion method for Intranet search. Experimental results on a large-scale query logs collection show that our personalised framework significantly improves the quality of suggested queries. △ Less

Submitted 8 January, 2017; originally announced January 2017.

Comments: 4 pages, 2 figures, the 2017 ACM SIGIR Conference on Human Information Interaction & Retrieval (CHIIR)

arXiv:1204.4071 [pdf, other]

Motivations for Participation in Socially Networked Collective Intelligence Systems

Authors: Jon Chamberlain, Udo Kruschwitz, Massimo Poesio

Abstract: One of the most significant challenges facing systems of collective intelligence is how to encourage participation on the scale required to produce high quality data. This paper details ongoing work with Phrase Detectives, an online game-with-a-purpose deployed on Facebook, and investigates user motivations for participation in social network gaming where the wisdom of crowds produces useful data. One of the most significant challenges facing systems of collective intelligence is how to encourage participation on the scale required to produce high quality data. This paper details ongoing work with Phrase Detectives, an online game-with-a-purpose deployed on Facebook, and investigates user motivations for participation in social network gaming where the wisdom of crowds produces useful data. △ Less

Submitted 18 April, 2012; originally announced April 2012.

Comments: Presented at Collective Intelligence conference, 2012 (arXiv:1204.2991)

Report number: CollectiveIntelligence/2012/50

Showing 1–14 of 14 results for author: Kruschwitz, U