pstore
Search
Search
Explorer
Home
/
Bibliography
/
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
tags
lit
link
https://proceedings.neurips.cc/paper_files/paper/2024/hash/f545448535dfde4f9786555403ab7c49-Abstract-Conference.html
zotero
zotero://select/library/items/ZMBCXP7Q
itemType
journalArticle
authors
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
pubDate
2024-12-16
retDate
2025-10-30
relatedProjects
DIY AI
Graph View
Backlinks
DIY AI Related Work