Search | arXiv e-print repository

doi 10.1145/3539618.3591887

MMEAD: MS MARCO Entity Annotations and Disambiguations

Authors: Chris Kamphuis, Aileen Lin, Siwen Yang, Jimmy Lin, Arjen P. de Vries, Faegheh Hasibi

Abstract: MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a resource for entity links for the MS MARCO datasets. We specify a format to store and share links for both document and passage collections of MS MARCO. Following this specification, we release entity links to Wikipedia for documents and passages in both MS MARCO collections (v1 and v2). Entity links have been produced by the REL and… ▽ More MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a resource for entity links for the MS MARCO datasets. We specify a format to store and share links for both document and passage collections of MS MARCO. Following this specification, we release entity links to Wikipedia for documents and passages in both MS MARCO collections (v1 and v2). Entity links have been produced by the REL and BLINK systems. MMEAD is an easy-to-install Python package, allowing users to load the link data and entity embeddings effortlessly. Using MMEAD takes only a few lines of code. Finally, we show how MMEAD can be used for IR research that uses entity information. We show how to improve recall@1000 and MRR@10 on more complex queries on the MS MARCO v1 passage dataset by using this resource. We also demonstrate how entity expansions can be used for interactive search applications. △ Less

Submitted 14 September, 2023; originally announced September 2023.

arXiv:2010.12674 [pdf, other]

Exploring task-based query expansion at the TREC-COVID track

Authors: Thomas Schoegje, Chris Kamphuis, Koen Dercksen, Djoerd Hiemstra, Toine Pieters, Arjen de Vries

Abstract: We explore how to generate effective queries based on search tasks. Our approach has three main steps: 1) identify search tasks based on research goals, 2) manually classify search queries according to those tasks, and 3) compare three methods to improve search rankings based on the task context. The most promising approach is based on expanding the user's query terms using task terms, which sligh… ▽ More We explore how to generate effective queries based on search tasks. Our approach has three main steps: 1) identify search tasks based on research goals, 2) manually classify search queries according to those tasks, and 3) compare three methods to improve search rankings based on the task context. The most promising approach is based on expanding the user's query terms using task terms, which slightly improved the NDCG@20 scores over a BM25 baseline. Further improvements might be gained if we can identify more specific search tasks. △ Less

Submitted 16 November, 2020; v1 submitted 23 October, 2020; originally announced October 2020.

Comments: Update version 2: Improved title Update version 3: corrected terminology hyponym -> hypernym in two instances Documents our participation to the TREC-COVID track. Contains 16 pages, 0 figures

arXiv:2003.08276 [pdf, other]

Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

Authors: Jimmy Lin, Joel Mackenzie, Chris Kamphuis, Craig Macdonald, Antonio Mallia, Michał Siedlaczek, Andrew Trotman, Arjen de Vries

Abstract: There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers tha… ▽ More There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers that allow one system to directly read the indexes of another. The second involves sharing indexes across systems via a data exchange specification that we have developed, called the Common Index File Format (CIFF). We demonstrate the first approach with the Java systems Anserini and Terrier, and the second approach with Anserini, JASSv2, OldDog, PISA, and Terrier. Together, these systems provide a wide range of implementations and features, with different research goals. Overall, we recommend CIFF as a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward. △ Less

Submitted 18 March, 2020; originally announced March 2020.

Showing 1–3 of 3 results for author: Kamphuis, C