Zum Hauptinhalt springen

Showing 1–5 of 5 results for author: Faysse, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01449  [pdf, other

    cs.IR cs.CL cs.CV

    ColPali: Efficient Document Retrieval with Vision Language Models

    Authors: Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo

    Abstract: Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. To benchmark c… ▽ More

    Submitted 2 July, 2024; v1 submitted 27 June, 2024; originally announced July 2024.

    Comments: Under Review

  2. arXiv:2402.12997  [pdf, other

    cs.IR cs.CL

    Towards Trustworthy Reranking: A Simple yet Effective Abstention Mechanism

    Authors: Hippolyte Gisserot-Boukhlef, Manuel Faysse, Emmanuel Malherbe, Céline Hudelot, Pierre Colombo

    Abstract: Neural Information Retrieval (NIR) has significantly improved upon heuristic-based IR systems. Yet, failures remain frequent, the models used often being unable to retrieve documents relevant to the user's query. We address this challenge by proposing a lightweight abstention mechanism tailored for real-world constraints, with particular emphasis placed on the reranking phase. We introduce a proto… ▽ More

    Submitted 2 April, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

  3. arXiv:2402.09363  [pdf, other

    cs.CL cs.CR

    Copyright Traps for Large Language Models

    Authors: Matthieu Meeus, Igor Shilov, Manuel Faysse, Yves-Alexandre de Montjoye

    Abstract: Questions of fair use of copyright-protected content to train Large Language Models (LLMs) are being actively debated. Document-level inference has been proposed as a new task: inferring from black-box access to the trained model whether a piece of content has been seen during training. SOTA methods however rely on naturally occurring memorization of (part of) the content. While very effective aga… ▽ More

    Submitted 4 June, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

    Comments: 41st International Conference on Machine Learning (ICML 2024)

  4. arXiv:2402.00786  [pdf, other

    cs.CL cs.LG

    CroissantLLM: A Truly Bilingual French-English Language Model

    Authors: Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, António Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro H. Martins, Antoni Bigata Casademunt, François Yvon, André F. T. Martins, Gautier Viaud, Céline Hudelot, Pierre Colombo

    Abstract: We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a cust… ▽ More

    Submitted 29 March, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  5. Revisiting Instruction Fine-tuned Model Evaluation to Guide Industrial Applications

    Authors: Manuel Faysse, Gautier Viaud, Céline Hudelot, Pierre Colombo

    Abstract: Instruction Fine-Tuning (IFT) is a powerful paradigm that strengthens the zero-shot capabilities of Large Language Models (LLMs), but in doing so induces new evaluation metric requirements. We show LLM-based metrics to be well adapted to these requirements, and leverage them to conduct an investigation of task-specialization strategies, quantifying the trade-offs that emerge in practical industria… ▽ More

    Submitted 21 October, 2023; originally announced October 2023.

    Comments: Short paper accepted at EMNLP 2023

    Journal ref: 2023.emnlp-main.559