Zum Hauptinhalt springen

Showing 1–3 of 3 results for author: O'Gara, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.19852  [pdf, other

    cs.AI

    AI Alignment: A Comprehensive Survey

    Authors: Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan, Aidan O'Gara, Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, Wen Gao

    Abstract: AI alignment aims to make AI systems behave in line with human intentions and values. As AI systems grow more capable, so do risks from misalignment. To provide a comprehensive and up-to-date overview of the alignment field, in this survey, we delve into the core concepts, methodology, and practice of alignment. First, we identify four principles as the key objectives of AI alignment: Robustness,… ▽ More

    Submitted 1 May, 2024; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: Continually updated, including weak-to-strong generalization and socio-technical thinking. 58 pages (excluding bibliography), 801 references

  2. arXiv:2308.14752  [pdf, other

    cs.CY cs.AI cs.HC

    AI Deception: A Survey of Examples, Risks, and Potential Solutions

    Authors: Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, Dan Hendrycks

    Abstract: This paper argues that a range of current AI systems have learned how to deceive humans. We define deception as the systematic inducement of false beliefs in the pursuit of some outcome other than the truth. We first survey empirical examples of AI deception, discussing both special-use AI systems (including Meta's CICERO) built for specific competitive situations, and general-purpose AI systems (… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: 18 pages (not including executive summary, references, and appendix), six figures

  3. arXiv:2308.01404  [pdf, other

    cs.CL cs.CY cs.LG

    Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models

    Authors: Aidan O'Gara

    Abstract: Are current language models capable of deception and lie detection? We study this question by introducing a text-based game called $\textit{Hoodwinked}$, inspired by Mafia and Among Us. Players are locked in a house and must find a key to escape, but one player is tasked with killing the others. Each time a murder is committed, the surviving players have a natural language discussion then vote to… ▽ More

    Submitted 3 August, 2023; v1 submitted 5 July, 2023; originally announced August 2023.

    Comments: Added reference for McKenzie 2023; updated acknowledgements