pstore

Home

/

Bibliography

/

Refusal in Language Models Is Mediated by a Single Direction

Refusal in Language Models Is Mediated by a Single Direction

tags
  • lit
link
https://proceedings.neurips.cc/paper_files/paper/2024/hash/f545448535dfde4f9786555403ab7c49-Abstract-Conference.html
zotero
zotero://select/library/items/ZMBCXP7Q
itemType
journalArticle
authors
  • Andy Arditi
  • Oscar Obeso
  • Aaquib Syed
  • Daniel Paleka
  • Nina Panickssery
  • Wes Gurnee
  • Neel Nanda
pubDate
2024-12-16
retDate
2025-10-30
relatedProjects
DIY AI

Graph View

Backlinks

  • DIY AI Related Work

Created with Quartz v4.5.2 © 2025

  • GitHub
  • Email