Folder: research/zotero

Oct 29, 2024

A Roadmap to Pluralistic Alignment

Oct 29, 2024

Ablation Programming for Machine Learning

Oct 29, 2024

Artificial Intelligence, Values, and Alignment

Oct 29, 2024

Constitutional AI - Harmlessness from AI Feedback

Oct 29, 2024

Direct Preference Optimization - Your Language Model is Secretly a Reward Model

Oct 29, 2024

ELIZA — a computer program for the study of natural language communication between man and machine

Oct 29, 2024

Generative AI Misuse - A Taxonomy of Tactics and Insights from Real-World Data

Oct 29, 2024

Harms from Increasingly Agentic Algorithmic Systems

Oct 29, 2024

How Culture Shapes What People Want From AI

Oct 29, 2024

Language Models Learn to Mislead Humans via RLHF

Oct 29, 2024

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Oct 29, 2024

Measuring the Algorithmic Efficiency of Neural Networks

Oct 29, 2024

MemGPT - Towards LLMs as Operating Systems

Oct 29, 2024

On the Dangers of Stochastic Parrots - Can Language Models Be Too Big 🦜

Oct 29, 2024

Red Teaming Language Models to Reduce Harms - Methods, Scaling Behaviors, and Lessons Learned

Oct 29, 2024

Refusal in Language Models Is Mediated by a Single Direction

Oct 29, 2024

Removing RLHF Protections in GPT-4 via Fine-Tuning

Oct 29, 2024

Scaling Laws for Neural Language Models

Oct 29, 2024

Shadow Alignment - The Ease of Subverting Safely-Aligned Language Models

Oct 29, 2024

The Canceling of the American Mind - Cancel Culture Undermines Trust and Threatens Us All—But There Is a Solution

Oct 29, 2024

The Llama 3 Herd of Models

Oct 29, 2024

Towards Bidirectional Human-AI Alignment - A Systematic Review for Clarifications, Framework, and Future Directions

Oct 29, 2024

Training language models to follow instructions with human feedback

Oct 29, 2024

Universal and Transferable Adversarial Attacks on Aligned Language Models

Oct 29, 2024

chandrasekharan2018

Oct 29, 2024

chandrasekharan2019

Oct 29, 2024

jiang2023

Oct 29, 2024

koshy2023

Oct 29, 2024

li2022

Oct 29, 2024

santurkar

Oct 29, 2024

weld2022a

Oct 29, 2024

‘Hi Chatbot, let’s Talk about Politics!’ Examining the Impact of Verbal Anthropomorphism in Conversational Agent Voting Advice Applications (CAVAAs) on Higher and Lower Politically Sophisticated Users

Woohyeuk Lee

Explorer

Folder: research/zotero

A Roadmap to Pluralistic Alignment

Ablation Programming for Machine Learning

Artificial Intelligence, Values, and Alignment

Constitutional AI - Harmlessness from AI Feedback

Direct Preference Optimization - Your Language Model is Secretly a Reward Model

ELIZA — a computer program for the study of natural language communication between man and machine

Generative AI Misuse - A Taxonomy of Tactics and Insights from Real-World Data

Harms from Increasingly Agentic Algorithmic Systems

How Culture Shapes What People Want From AI

Language Models Learn to Mislead Humans via RLHF

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Measuring the Algorithmic Efficiency of Neural Networks

MemGPT - Towards LLMs as Operating Systems

On the Dangers of Stochastic Parrots - Can Language Models Be Too Big 🦜

Red Teaming Language Models to Reduce Harms - Methods, Scaling Behaviors, and Lessons Learned

Refusal in Language Models Is Mediated by a Single Direction

Removing RLHF Protections in GPT-4 via Fine-Tuning

Scaling Laws for Neural Language Models

Shadow Alignment - The Ease of Subverting Safely-Aligned Language Models

The Canceling of the American Mind - Cancel Culture Undermines Trust and Threatens Us All—But There Is a Solution

The Llama 3 Herd of Models

Towards Bidirectional Human-AI Alignment - A Systematic Review for Clarifications, Framework, and Future Directions

Training language models to follow instructions with human feedback

Universal and Transferable Adversarial Attacks on Aligned Language Models

chandrasekharan2018

chandrasekharan2019

jiang2023

koshy2023

li2022

santurkar

weld2022a

‘Hi Chatbot, let’s Talk about Politics!’ Examining the Impact of Verbal Anthropomorphism in Conversational Agent Voting Advice Applications (CAVAAs) on Higher and Lower Politically Sophisticated Users