Zum Hauptinhalt springen

Showing 1–2 of 2 results for author: McKinney, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2303.08112  [pdf, other

    cs.LG

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Authors: Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt

    Abstract: We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the \emph{tuned lens}, is a refinement of the earlier ``logit lens'' technique… ▽ More

    Submitted 26 November, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

  2. arXiv:2301.03652  [pdf, other

    cs.LG cs.AI

    On The Fragility of Learned Reward Functions

    Authors: Lev McKinney, Yawen Duan, David Krueger, Adam Gleave

    Abstract: Reward functions are notoriously difficult to specify, especially for tasks with complex goals. Reward learning approaches attempt to infer reward functions from human feedback and preferences. Prior works on reward learning have mainly focused on the performance of policies trained alongside the reward function. This practice, however, may fail to detect learned rewards that are not capable of tr… ▽ More

    Submitted 9 January, 2023; originally announced January 2023.

    Comments: 5 pages, 2 figures, presented at the NeurIPS Deep RL and ML Safety Workshops