Skip to main content

Showing 1–35 of 35 results for author: Jatowt, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.02395  [pdf, other

    cs.SE cs.CL

    Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

    Authors: Jiexin Wang, Xitong Luo, Liuwen Cao, Hongkui He, Hailin Huang, Jiayuan Xie, Adam Jatowt, Yi Cai

    Abstract: Large language models (LLMs) have brought significant advancements to code generation and code repair, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities. Despite numerous studies investigating the safety of code LLMs, there remains a gap… ▽ More

    Submitted 4 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2310.16263

  2. arXiv:2406.04866  [pdf, other

    cs.CL

    ComplexTempQA: A Large-Scale Dataset for Complex Temporal Question Answering

    Authors: Raphael Gruber, Abdelrahman Abdallah, Michael Färber, Adam Jatowt

    Abstract: We introduce ComplexTempQA,a large-scale dataset consisting of over 100 million question-answer pairs designed to tackle the challenges in temporal question answering. ComplexTempQA significantly surpasses existing benchmarks like HOTPOTQA, TORQUE, and TEQUILA in scale and scope. Utilizing data from Wikipedia and Wikidata, the dataset covers questions spanning over two decades and offers an unmatc… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  3. arXiv:2406.04493  [pdf, other

    cs.CV cs.CL

    CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

    Authors: Abdelrahman Abdallah, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Ibrahim Abdelhalim, Mohamed Elkasaby, Yasser ElBendary, Adam Jatowt

    Abstract: In the fields of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and informati… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  4. arXiv:2406.01863  [pdf, other

    cs.CL

    Towards Effective Time-Aware Language Representation: Exploring Enhanced Temporal Understanding in Language Models

    Authors: Jiexin Wang, Adam Jatowt, Yi Cai

    Abstract: In the evolving field of Natural Language Processing, understanding the temporal context of text is increasingly crucial. This study investigates methods to incorporate temporal information during pre-training, aiming to achieve effective time-aware language representation for improved performance on time-related tasks. In contrast to common pre-trained models like BERT, which rely on synchronic d… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  5. arXiv:2404.04728  [pdf, other

    cs.CL cs.HC

    Navigating the Landscape of Hint Generation Research: From the Past to the Future

    Authors: Anubhav Jangra, Jamshid Mozafari, Adam Jatowt, Smaranda Muresan

    Abstract: Digital education has gained popularity in the last decade, especially after the COVID-19 pandemic. With the improving capabilities of large language models to reason and communicate with users, envisioning intelligent tutoring systems (ITSs) that can facilitate self-learning is not very far-fetched. One integral component to fulfill this vision is the ability to give accurate and effective feedba… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: Submitted to TACL'24

  6. TriviaHG: A Dataset for Automatic Hint Generation from Factoid Questions

    Authors: Jamshid Mozafari, Anubhav Jangra, Adam Jatowt

    Abstract: Nowadays, individuals tend to engage in dialogues with Large Language Models, seeking answers to their questions. In times when such answers are readily accessible to anyone, the stimulation and preservation of human's cognitive abilities, as well as the assurance of maintaining good reasoning skills by humans becomes crucial. This study addresses such needs by proposing hints (instead of final an… ▽ More

    Submitted 10 May, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

    Comments: Accepted at SIGIR 2024

    Journal ref: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024)

  7. ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

    Authors: Bhawna Piryani, Jamshid Mozafari, Adam Jatowt

    Abstract: Question answering (QA) and Machine Reading Comprehension (MRC) tasks have significantly advanced in recent years due to the rapid development of deep learning techniques and, more recently, large language models. At the same time, many benchmark datasets have become available for QA and MRC tasks. However, most existing large-scale benchmark datasets have been created predominantly using synchron… ▽ More

    Submitted 10 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted at SIGIR 2024

    Journal ref: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024)

  8. arXiv:2403.17848  [pdf, other

    cs.CL cs.IR

    ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

    Authors: Abdelrahman Abdallah, Mahmoud Kasem, Mahmoud Abdalla, Mohamed Mahmoud, Mohamed Elkasaby, Yasser Elbendary, Adam Jatowt

    Abstract: In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted at SIGIR 2024

  9. arXiv:2403.04080  [pdf, other

    cs.CL cs.CV

    Transformers and Language Models in Form Understanding: A Comprehensive Review of Scanned Document Analysis

    Authors: Abdelrahman Abdallah, Daniel Eberharter, Zoe Pfister, Adam Jatowt

    Abstract: This paper presents a comprehensive survey of research works on the topic of form understanding in the context of scanned documents. We delve into recent advancements and breakthroughs in the field, highlighting the significance of language models and transformers in solving this challenging task. Our research methodology involves an in-depth analysis of popular documents and forms of understandin… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  10. arXiv:2401.12078  [pdf, other

    cs.CL

    Temporal Blind Spots in Large Language Models

    Authors: Jonas Wallat, Adam Jatowt, Avishek Anand

    Abstract: Large language models (LLMs) have recently gained significant attention due to their unparalleled ability to perform various natural language processing tasks. These models, benefiting from their advanced natural language understanding capabilities, have demonstrated impressive zero-shot performance. However, the pre-training data utilized in LLMs is often confined to a specific corpus, resulting… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: accepted at WSDM'24

  11. arXiv:2401.00779  [pdf, other

    cs.CL cs.AI

    Temporal Validity Change Prediction

    Authors: Georg Wenzel, Adam Jatowt

    Abstract: Temporal validity is an important property of text that is useful for many downstream applications, such as recommender systems, conversational AI, or story understanding. Existing benchmarking tasks often require models to identify the temporal validity duration of a single statement. However, in many cases, additional contextual information, such as sentences in a story or posts on a social medi… ▽ More

    Submitted 1 January, 2024; originally announced January 2024.

    Comments: 9 pages, 9 figures, 3 tables

    MSC Class: 68T50 ACM Class: I.2.7

  12. arXiv:2310.16263  [pdf, other

    cs.SE cs.AI cs.CL cs.CR

    Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation

    Authors: Jiexin Wang, Liuwen Cao, Xitong Luo, Zhiping Zhou, Jiayuan Xie, Adam Jatowt, Yi Cai

    Abstract: Large language models (LLMs) have brought significant advancements to code generation, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, introduces the risk of inadvertently propagating security vulnerabilities. To effectively mitigate this concern, this paper presents a comprehensive study focused on evalu… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  13. arXiv:2309.09800  [pdf, other

    cs.CL

    AMuRD: Annotated Arabic-English Receipt Dataset for Key Information Extraction and Classification

    Authors: Abdelrahman Abdallah, Mahmoud Abdalla, Mohamed Elkasaby, Yasser Elbendary, Adam Jatowt

    Abstract: The extraction of key information from receipts is a complex task that involves the recognition and extraction of text from scanned receipts. This process is crucial as it enables the retrieval of essential content and organizing it into structured documents for easy access and analysis. In this paper, we present AMuRD, a novel multilingual human-annotated dataset specifically designed for informa… ▽ More

    Submitted 26 March, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

  14. arXiv:2308.03531  [pdf, other

    cs.CL cs.AI

    Measuring Variety, Balance, and Disparity: An Analysis of Media Coverage of the 2021 German Federal Election

    Authors: Michael Färber, Jannik Schwade, Adam Jatowt

    Abstract: Determining and measuring diversity in news articles is important for a number of reasons, including preventing filter bubbles and fueling public discourse, especially before elections. So far, the identification and analysis of diversity have been illuminated in a variety of ways, such as measuring the overlap of words or topics between news articles related to US elections. However, the question… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

  15. arXiv:2308.00002  [pdf, ps, other

    cs.AI cs.CL cs.LG

    An Overview Of Temporal Commonsense Reasoning and Acquisition

    Authors: Georg Wenzel, Adam Jatowt

    Abstract: Temporal commonsense reasoning refers to the ability to understand the typical temporal context of phrases, actions, and events, and use it to reason over problems requiring such knowledge. This trait is essential in temporal natural language processing tasks, with possible applications such as timeline summarization, temporal question answering, and temporal natural language inference. Recent res… ▽ More

    Submitted 16 November, 2023; v1 submitted 27 July, 2023; originally announced August 2023.

    Comments: 27 pages, 7 figures, 6 tables

    MSC Class: 68T50 ACM Class: I.2.7

  16. arXiv:2307.11278  [pdf, other

    cs.CL

    Generator-Retriever-Generator Approach for Open-Domain Question Answering

    Authors: Abdelrahman Abdallah, Adam Jatowt

    Abstract: Open-domain question answering (QA) tasks usually require the retrieval of relevant information from a large corpus to generate accurate answers. We propose a novel approach called Generator-Retriever-Generator (GRG) that combines document retrieval techniques with a large language model (LLM), by first prompting the model to generate contextual documents based on a given question. In parallel, a… ▽ More

    Submitted 26 March, 2024; v1 submitted 20 July, 2023; originally announced July 2023.

  17. Exploring the State of the Art in Legal QA Systems

    Authors: Abdelrahman Abdallah, Bhawna Piryani, Adam Jatowt

    Abstract: Answering questions related to the legal domain is a complex task, primarily due to the intricate nature and diverse range of legal document systems. Providing an accurate answer to a legal query typically necessitates specialized knowledge in the relevant domain, which makes this task all the more challenging, even for human experts. Question answering (QA) systems are designed to generate answer… ▽ More

    Submitted 15 September, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    Journal ref: J Big Data 10, 127 (2023)

  18. arXiv:2301.13479  [pdf, other

    cs.CL

    Archive TimeLine Summarization (ATLS): Conceptual Framework for Timeline Generation over Historical Document Collections

    Authors: Nicolas Gutehrlé, Antoine Doucet, Adam Jatowt

    Abstract: Archive collections are nowadays mostly available through search engines interfaces, which allow a user to retrieve documents by issuing queries. The study of these collections may be, however, impaired by some aspects of search engines, such as the overwhelming number of documents returned or the lack of contextual knowledge provided. New methods that could work independently or in combination wi… ▽ More

    Submitted 31 January, 2023; originally announced January 2023.

    Journal ref: Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Oct 2022, Gyeongju, Republic of Korea, France. pp.13-23

  19. arXiv:2212.11765  [pdf, other

    q-fin.GN cs.IR cs.LG

    Predicting Companies' ESG Ratings from News Articles Using Multivariate Timeseries Analysis

    Authors: Tanja Aue, Adam Jatowt, Michael Färber

    Abstract: Environmental, social and governance (ESG) engagement of companies moved into the focus of public attention over recent years. With the requirements of compulsory reporting being implemented and investors incorporating sustainability in their investment decisions, the demand for transparent and reliable ESG ratings is increasing. However, automatic approaches for forecasting ESG ratings have been… ▽ More

    Submitted 13 November, 2022; originally announced December 2022.

  20. arXiv:2212.01669  [pdf, other

    cs.CL

    A Survey on Medical Document Summarization

    Authors: Raghav Jain, Anubhav Jangra, Sriparna Saha, Adam Jatowt

    Abstract: The internet has had a dramatic effect on the healthcare industry, allowing documents to be saved, shared, and managed digitally. This has made it easier to locate and share important data, improving patient care and providing more opportunities for medical studies. As there is so much data accessible to doctors and patients alike, summarizing it has become increasingly necessary - this has been s… ▽ More

    Submitted 3 December, 2022; originally announced December 2022.

  21. arXiv:2204.13032  [pdf, other

    cs.CL

    BiTimeBERT: Extending Pre-Trained Language Representations with Bi-Temporal Information

    Authors: Jiexin Wang, Adam Jatowt, Masatoshi Yoshikawa, Yi Cai

    Abstract: Time is an important aspect of documents and is used in a range of NLP and IR tasks. In this work, we investigate methods for incorporating temporal information during pre-training to further improve the performance on time-related tasks. Compared with common pre-trained language models like BERT which utilize synchronic document collections (e.g., BookCorpus and Wikipedia) as the training corpora… ▽ More

    Submitted 27 April, 2023; v1 submitted 27 April, 2022; originally announced April 2022.

  22. arXiv:2204.09140  [pdf, other

    cs.CL cs.AI cs.IR

    Multi-hop Question Answering

    Authors: Vaibhav Mavi, Anubhav Jangra, Adam Jatowt

    Abstract: The task of Question Answering (QA) has attracted significant research interest for long. Its relevance to language understanding and knowledge retrieval tasks, along with the simple setting makes the task of QA crucial for strong AI systems. Recent success on simple QA tasks has shifted the focus to more complex settings. Among these, Multi-Hop QA (MHQA) is one of the most researched tasks over t… ▽ More

    Submitted 31 May, 2024; v1 submitted 19 April, 2022; originally announced April 2022.

    Comments: Published at Foundations and Trends in Information Retrieval

  23. arXiv:2112.03634  [pdf, ps, other

    cs.DL cs.CL

    Change Summarization of Diachronic Scholarly Paper Collections by Semantic Evolution Analysis

    Authors: Naman Paharia, Muhammad Syafiq Mohd Pozi, Adam Jatowt

    Abstract: The amount of scholarly data has been increasing dramatically over the last years. For newcomers to a particular science domain (e.g., IR, physics, NLP) it is often difficult to spot larger trends and to position the latest research in the context of prior scientific achievements and breakthroughs. Similarly, researchers in the history of science are interested in tools that allow them to analyze… ▽ More

    Submitted 7 December, 2021; originally announced December 2021.

    Comments: 4 pages, JCDL-2021

  24. arXiv:2109.05199  [pdf, other

    cs.CL cs.MM cs.NE

    A Survey on Multi-modal Summarization

    Authors: Anubhav Jangra, Sourajit Mukherjee, Adam Jatowt, Sriparna Saha, Mohammad Hasanuzzaman

    Abstract: The new era of technology has brought us to the point where it is convenient for people to share their opinions over an abundance of platforms. These platforms have a provision for the users to express themselves in multiple forms of representations, including text, images, videos, and audio. This, however, makes it difficult for users to obtain all the key information about a topic, making the ta… ▽ More

    Submitted 13 February, 2023; v1 submitted 11 September, 2021; originally announced September 2021.

    Comments: Accepted in ACM CSUR 2023

  25. arXiv:2109.03438  [pdf, other

    cs.CL cs.AI

    ArchivalQA: A Large-scale Benchmark Dataset for Open Domain Question Answering over Historical News Collections

    Authors: Jiexin Wang, Adam Jatowt, Masatoshi Yoshikawa

    Abstract: In the last few years, open-domain question answering (ODQA) has advanced rapidly due to the development of deep learning techniques and the availability of large-scale QA datasets. However, the current datasets are essentially designed for synchronic document collections (e.g., Wikipedia). Temporal news collections such as long-term news archives spanning several decades, are rarely used in train… ▽ More

    Submitted 21 February, 2022; v1 submitted 8 September, 2021; originally announced September 2021.

  26. A Neural Conversation Generation Model via Equivalent Shared Memory Investigation

    Authors: Changzhen Ji, Yating Zhang, Xiaozhong Liu, Adam Jatowt, Changlong Sun, Conghui Zhu, Tiejun Zhao

    Abstract: Conversation generation as a challenging task in Natural Language Generation (NLG) has been increasingly attracting attention over the last years. A number of recent works adopted sequence-to-sequence structures along with external knowledge, which successfully enhanced the quality of generated conversations. Nevertheless, few works utilized the knowledge extracted from similar conversations for u… ▽ More

    Submitted 20 August, 2021; originally announced August 2021.

  27. arXiv:2108.08297  [pdf, other

    cs.AI

    Fact-Tree Reasoning for N-ary Question Answering over Knowledge Graphs

    Authors: Yao Zhang, Peiyao Li, Hongru Liang, Adam Jatowt, Zhenglu Yang

    Abstract: In the question answering(QA) task, multi-hop reasoning framework has been extensively studied in recent years to perform more efficient and interpretable answer reasoning on the Knowledge Graph(KG). However, multi-hop reasoning is inapplicable for answering n-ary fact questions due to its linear reasoning nature. We discover that there are two feasible improvements: 1) upgrade the basic reasoning… ▽ More

    Submitted 13 March, 2022; v1 submitted 17 August, 2021; originally announced August 2021.

    Comments: ACL 2022 (Findings)

  28. arXiv:2012.11957  [pdf, other

    cs.AI

    Generalized Relation Learning with Semantic Correlation Awareness for Link Prediction

    Authors: Yao Zhang, Xu Zhang, Jun Wang, Hongru Liang, Wenqiang Lei, Zhe Sun, Adam Jatowt, Zhenglu Yang

    Abstract: Developing link prediction models to automatically complete knowledge graphs has recently been the focus of significant research interest. The current methods for the link prediction taskhavetwonaturalproblems:1)the relation distributions in KGs are usually unbalanced, and 2) there are many unseen relations that occur in practical situations. These two problems limit the training effectiveness and… ▽ More

    Submitted 18 April, 2021; v1 submitted 22 December, 2020; originally announced December 2020.

    Comments: Preprint of accepted AAAI2021 paper

  29. arXiv:2010.07620  [pdf, other

    cs.AI cs.CL

    GMH: A General Multi-hop Reasoning Model for KG Completion

    Authors: Yao Zhang, Hongru Liang, Adam Jatowt, Wenqiang Lei, Xin Wei, Ning Jiang, Zhenglu Yang

    Abstract: Knowledge graphs are essential for numerous downstream natural language processing applications, but are typically incomplete with many facts missing. This results in research efforts on multi-hop reasoning task, which can be formulated as a search process and current models typically perform short distance reasoning. However, the long-distance reasoning is also vital with the ability to connect t… ▽ More

    Submitted 2 September, 2021; v1 submitted 15 October, 2020; originally announced October 2020.

    Comments: Accepted to EMNLP 2021; 10 pages, 5 figures, 4 tables

  30. arXiv:2005.09252  [pdf, other

    cs.IR

    Multi-Modal Summary Generation using Multi-Objective Optimization

    Authors: Anubhav Jangra, Sriparna Saha, Adam Jatowt, Mohammad Hasanuzzaman

    Abstract: Significant development of communication technology over the past few years has motivated research in multi-modal summarization techniques. A majority of the previous works on multi-modal summarization focus on text and images. In this paper, we propose a novel extractive multi-objective optimization based model to produce a multi-modal summary containing text, images, and videos. Important object… ▽ More

    Submitted 19 May, 2020; originally announced May 2020.

    Comments: 5 pages, 2 figures

  31. arXiv:2005.06748  [pdf, other

    cs.IR cs.CY

    ECIR 2020 Workshops: Assessing the Impact of Going Online

    Authors: Sérgio Nunes, Suzanne Little, Sumit Bhatia, Ludovico Boratto, Guillaume Cabanac, Ricardo Campos, Francisco M. Couto, Stefano Faralli, Ingo Frommholz, Adam Jatowt, Alípio Jorge, Mirko Marras, Philipp Mayr, Giovanni Stilo

    Abstract: ECIR 2020 https://ecir2020.org/ was one of the many conferences affected by the COVID-19 pandemic. The Conference Chairs decided to keep the initially planned dates (April 14-17, 2020) and move to a fully online event. In this report, we describe the experience of organizing the ECIR 2020 Workshops in this scenario from two perspectives: the workshop organizers and the workshop participants. We pr… ▽ More

    Submitted 14 May, 2020; originally announced May 2020.

    Comments: 10 pages, 3 figures, submitted to ACM SIGIR Forum

  32. arXiv:2003.08615  [pdf

    cs.LG cs.AI stat.ML

    Joint Event Extraction along Shortest Dependency Paths using Graph Convolutional Networks

    Authors: Ali Balali, Masoud Asadpour, Ricardo Campos, Adam Jatowt

    Abstract: Event extraction (EE) is one of the core information extraction tasks, whose purpose is to automatically identify and extract information about incidents and their actors from texts. This may be beneficial to several domains such as knowledge bases, question answering, information retrieval and summarization tasks, to name a few. The problem of extracting event information from texts is longstandi… ▽ More

    Submitted 19 March, 2020; originally announced March 2020.

    Journal ref: Knowledge-Based Systems, Volume 210, Year 2020, Page 106492

  33. Citation Recommendation: Approaches and Datasets

    Authors: Michael Färber, Adam Jatowt

    Abstract: Citation recommendation describes the task of recommending citations for a given text. Due to the overload of published scientific works in recent years on the one hand, and the need to cite the most appropriate publications when writing scientific texts on the other hand, citation recommendation has emerged as an important research topic. In recent years, several approaches and evaluation data se… ▽ More

    Submitted 14 May, 2020; v1 submitted 17 February, 2020; originally announced February 2020.

    Comments: to be published in the International Journal on Digital Libraries

  34. arXiv:1811.06278  [pdf, other

    cs.CL

    Survey of Computational Approaches to Lexical Semantic Change

    Authors: Nina Tahmasebi, Lars Borin, Adam Jatowt

    Abstract: Our languages are in constant flux driven by external factors such as cultural, societal and technological changes, as well as by only partially understood internal motivations. Words acquire new meanings and lose old senses, new words are coined or borrowed from other languages and obsolete words slide into obscurity. Understanding the characteristics of shifts in the meaning and in the use of wo… ▽ More

    Submitted 13 March, 2019; v1 submitted 15 November, 2018; originally announced November 2018.

    Comments: This survey is an extended version of Survey of Computational Approaches to Diachronic Conceptual Change

  35. arXiv:1707.00510  [pdf, other

    cs.DL

    Towards Understanding the Evolution of the WWW Conference

    Authors: Pavel Savov, Adam Jatowt, Radoslaw Nielek

    Abstract: The World Wide Web conference is a well-established and mature venue with an already long history. Over the years it has been attracting papers reporting many important research achievements centered around the Web. In this work we aim at understanding the evolution of WWW conference series by detecting crucial years and important topics. We propose a simple yet novel approach based on tracking th… ▽ More

    Submitted 3 July, 2017; originally announced July 2017.