Zum Hauptinhalt springen

Showing 1–1 of 1 results for author: Hardy, A F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.09447  [pdf, other

    cs.CL

    ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts

    Authors: Amelia F. Hardy, Houjun Liu, Bernard Lange, Mykel J. Kochenderfer

    Abstract: Typical schemes for automated red-teaming large language models (LLMs) focus on discovering prompts that trigger a frozen language model (the defender) to generate toxic text. This often results in the prompting model (the adversary) producing text that is unintelligible and unlikely to arise. Here, we propose a reinforcement learning formulation of the LLM red-teaming task which allows us to disc… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: 9 pages, 2 tables, 2 figures