-
Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers
Authors:
Mirelle Bueno,
Eduardo Seiti de Oliveira,
Rodrigo Nogueira,
Roberto A. Lotufo,
Jayr Alencar Pereira
Abstract:
Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portugues…
▽ More
Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portuguese websites. These websites are frequented more likely by real users compared to those randomly scraped, ensuring a more representative and relevant corpus. To label the query-document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. We provide a detailed description of our annotation methodology to enable others to create similar datasets for other languages, providing a cost-effective way of creating high-quality IR datasets with an arbitrary number of labeled documents per query. Finally, we evaluate a diverse range of open-source and commercial retrievers to serve as baseline systems. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati and all scripts at https://github.com/unicamp-dl/quati .
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Lissard: Long and Simple Sequential Reasoning Datasets
Authors:
Mirelle Bueno,
Roberto Lotufo,
Rodrigo Nogueira
Abstract:
Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens. However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training. For example, state-of-the-art LLMs can find common items in two lists with up to 20 items but fail when lists h…
▽ More
Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens. However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training. For example, state-of-the-art LLMs can find common items in two lists with up to 20 items but fail when lists have 80 items. In this paper, we introduce Lissard, a benchmark comprising seven tasks whose goal is to assess the ability of models to process and generate wide-range sequence lengths, requiring repetitive procedural execution. Our evaluation of open-source (Mistral-7B and Mixtral-8x7B) and proprietary models (GPT-3.5 and GPT-4) show a consistent decline in performance across all models as the complexity of the sequence increases. The datasets and code are available at https://github.com/unicamp-dl/Lissard
△ Less
Submitted 20 February, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
Induced Natural Language Rationales and Interleaved Markup Tokens Enable Extrapolation in Large Language Models
Authors:
Mirelle Bueno,
Carlos Gemmell,
Jeffrey Dalton,
Roberto Lotufo,
Rodrigo Nogueira
Abstract:
The ability to extrapolate, i.e., to make predictions on sequences that are longer than those presented as training examples, is a challenging problem for current deep learning models. Recent work shows that this limitation persists in state-of-the-art Transformer-based models. Most solutions to this problem use specific architectures or training methods that do not generalize to other tasks. We d…
▽ More
The ability to extrapolate, i.e., to make predictions on sequences that are longer than those presented as training examples, is a challenging problem for current deep learning models. Recent work shows that this limitation persists in state-of-the-art Transformer-based models. Most solutions to this problem use specific architectures or training methods that do not generalize to other tasks. We demonstrate that large language models can succeed in extrapolation without modifying their architecture or training procedure. Our experimental results show that generating step-by-step rationales and introducing marker tokens are both required for effective extrapolation. First, we induce a language model to produce step-by-step rationales before outputting the answer to effectively communicate the task to the model. However, as sequences become longer, we find that current models struggle to keep track of token positions. To address this issue, we interleave output tokens with markup tokens that act as explicit positional and counting symbols. Our findings show how these two complementary approaches enable remarkable sequence extrapolation and highlight a limitation of current architectures to effectively generalize without explicit surface form guidance. Code available at https://github.com/MirelleB/induced-rationales-markup-tokens
△ Less
Submitted 28 November, 2022; v1 submitted 24 August, 2022;
originally announced August 2022.
-
AMLB: an AutoML Benchmark
Authors:
Pieter Gijsbers,
Marcos L. P. Bueno,
Stefan Coors,
Erin LeDell,
Sébastien Poirier,
Janek Thomas,
Bernd Bischl,
Joaquin Vanschoren
Abstract:
Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explo…
▽ More
Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-offs with inference time, and framework failures. We also use Bradley-Terry trees to discover subsets of tasks where the relative AutoML framework rankings differ. The benchmark comes with an open-source tool that integrates with many AutoML frameworks and automates the empirical evaluation process end-to-end: from framework installation and resource allocation to in-depth evaluation. The benchmark uses public data sets, can be easily extended with other AutoML frameworks and tasks, and has a website with up-to-date results.
△ Less
Submitted 16 November, 2023; v1 submitted 25 July, 2022;
originally announced July 2022.
-
Issue Auto-Assignment in Software Projects with Machine Learning Techniques
Authors:
Pedro Oliveira,
Rossana M. C. Andrade,
Tales P. Nogueira,
Isaac Barreto,
Leandro Morais Bueno
Abstract:
Usually, managers or technical leaders in software projects assign issues manually. This task may become more complex as more detailed is the issue description. This complexity can also make the process more prone to errors (misassignments) and time-consuming. In the literature, many studies aim to address this problem by using machine learning strategies. Although there is no specific solution th…
▽ More
Usually, managers or technical leaders in software projects assign issues manually. This task may become more complex as more detailed is the issue description. This complexity can also make the process more prone to errors (misassignments) and time-consuming. In the literature, many studies aim to address this problem by using machine learning strategies. Although there is no specific solution that works for all companies, experience reports are useful to guide the choices in industrial auto-assignment projects. This paper presents an industrial initiative conducted in a global electronics company that aims to minimize the time spent and the errors that can arise in the issue assignment process. As main contributions, we present a literature review, an industrial report comparing different algorithms, and lessons learned during the project.
△ Less
Submitted 4 April, 2021;
originally announced April 2021.
-
3-Colorable Delaunay Triangulations
Authors:
Lucas Moutinho Bueno
Abstract:
We propose an algorithm to create a 3-colorable Delaunay Triangulation. The input of the problem we are trying to solve is a set X of n twodimensional points. The output is a 3-colorable two-dimensional Delaunay triangulation T for X U Y , where Y is a set of m new points. We want to m be as few as possible.
We propose an algorithm to create a 3-colorable Delaunay Triangulation. The input of the problem we are trying to solve is a set X of n twodimensional points. The output is a 3-colorable two-dimensional Delaunay triangulation T for X U Y , where Y is a set of m new points. We want to m be as few as possible.
△ Less
Submitted 24 December, 2018;
originally announced December 2018.
-
Bayesian approach for near-duplicate image detection
Authors:
Lucas Moutinho Bueno,
Eduardo Valle,
Ricardo da Silva Torres
Abstract:
In this paper we propose a bayesian approach for near-duplicate image detection, and investigate how different probabilistic models affect the performance obtained. The task of identifying an image whose metadata are missing is often demanded for a myriad of applications: metadata retrieval in cultural institutions, detection of copyright violations, investigation of latent cross-links in archives…
▽ More
In this paper we propose a bayesian approach for near-duplicate image detection, and investigate how different probabilistic models affect the performance obtained. The task of identifying an image whose metadata are missing is often demanded for a myriad of applications: metadata retrieval in cultural institutions, detection of copyright violations, investigation of latent cross-links in archives and libraries, duplicate elimination in storage management, etc. The majority of current solutions are based either on voting algorithms, which are very precise, but expensive; either on the use of visual dictionaries, which are efficient, but less precise. Our approach, uses local descriptors in a novel way, which by a careful application of decision theory, allows a very fine control of the compromise between precision and efficiency. In addition, the method attains a great compromise between those two axes, with more than 99% accuracy with less than 10 database operations.
△ Less
Submitted 25 April, 2011;
originally announced April 2011.