Zum Hauptinhalt springen

Showing 1–4 of 4 results for author: Valentine, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.15211  [pdf, other

    cs.CL cs.AI cs.CR cs.CV cs.LG

    When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?

    Authors: Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez

    Abstract: The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transfer… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

  2. arXiv:2402.06782  [pdf, other

    cs.AI cs.CL

    Debating with More Persuasive LLMs Leads to More Truthful Answers

    Authors: Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez

    Abstract: Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this… ▽ More

    Submitted 25 July, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

    Comments: For code please check: https://github.com/ucl-dark/llm_debate

  3. arXiv:2312.02566  [pdf, other

    cs.LG cs.AI

    Structured World Representations in Maze-Solving Transformers

    Authors: Michael Igorevich Ivanitskiy, Alex F. Spies, Tilman Räuker, Guillaume Corlouer, Chris Mathwin, Lucia Quirke, Can Rager, Rusheb Shah, Dan Valentine, Cecilia Diniz Behn, Katsumi Inoue, Samy Wu Fung

    Abstract: Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers. Given the size and complexity of these models, forming a comprehensive picture of their inner workings remains a significant challenge. To this end, we set out to understand small transformer models in a more tractable setting: that… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

    Comments: 15 pages, 18 figures, 15 tables. Corresponding author: Michael Ivanitskiy ([email protected]). Code available at https://github.com/understanding-search/structured-representations-maze-transformers

  4. arXiv:2309.10498  [pdf, other

    cs.LG cs.AI cs.SE

    A Configurable Library for Generating and Manipulating Maze Datasets

    Authors: Michael Igorevich Ivanitskiy, Rusheb Shah, Alex F. Spies, Tilman Räuker, Dan Valentine, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, Cecilia Diniz Behn, Samy Wu Fung

    Abstract: Understanding how machine learning models respond to distributional shifts is a key research challenge. Mazes serve as an excellent testbed due to varied generation algorithms offering a nuanced platform to simulate both subtle and pronounced distributional shifts. To enable systematic investigations of model behavior on out-of-distribution data, we present $\texttt{maze-dataset}$, a comprehensive… ▽ More

    Submitted 24 October, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

    Comments: 9 pages, 5 figures, 1 table. Corresponding author: Michael Ivanitskiy ([email protected]). Code available at https://github.com/understanding-search/maze-dataset