Primary Dataset and Data Source

The primary dataset will be the LMSYS Chatbot Arena Conversations. This dataset contains 33,000 cleaned, multi-turn conversations collected from approximately 13,000 unique IP addresses between April and June 2023.

This data was sourced from the Chatbot Arena, a crowdsourcing platform where users are prompted to chat with two anonymous large language models (LLMs) and then vote for which response they prefer.

Ethical Considerations and Preprocessing

User consent was obtained via the platform’s terms of use. All personally identifiable information (PII) has been removed from the dataset, and no demographic data is available.

The dataset already includes flags from the OpenAI moderation API for inappropriate conversations, which will serve as an initial filter for locating refusal instances. To further refine this process, a specialized refusal classifier (e.g., NousResearch/Minos-v1, trained on ~400K examples) will be used to systematically identify a comprehensive set of refusal examples from the 33,000 conversations.

Analysis Procedure

The analysis will proceed in three stages:

Refusal Identification: Use the aforementioned classifier and OpenAI moderation flags to isolate all instances of model refusals within the dataset.
Refusal Style Classification: Employ or develop a second classifier to categorize each identified refusal into distinct styles (e.g., blunt, polite, moralizing, redirecting).
User Response Analysis: For each classified refusal, the user’s immediate subsequent prompt (the next conversational turn) will be analyzed to categorize their reaction (e.g., task abandonment, simple re-prompting, adversarial response).

Contingency Datasets

As a fallback or for comparative analysis, structured benchmarks designed to elicit refusals will be explored. These include RefusalBench, SORRY-Bench, and HarmBench.

Data Filtering Results

The LMSYS Chatbot Arena Conversations dataset originally had 1M user interactions, each of which consists of human instruction-machine response pairs, or multi-turn conversations. I removed non-English data (removed 222,547, 777,453 remaining) as the refusal classifier was primarily trained to detect refusals in English. At this stage, I faced some restraints in the amount of computational power I had access to, so I decided to drastically cut down the dataset to 100,000 user interactions, and streamed those into the refusal classifier to further remove all non-refusal data (removed 85,841, 14,159 remaining). From the dataset of 14,159 user interactions, I created another random subset of 1,000 for developing the codebook.

The thematic analysis process of the refusal messages began with

Data filtering
- 1M original dataset
- Removed non-English data
- Removed all non-refusal data
  - Only could process 100k due to compute restraints
- Removed all data that did not have a next turn
- Resulted in 14,159 responses
Extracting codes for refusals
- Random subset of 1000 because of compute limitations
- Resulted in xxx codes
Understanding relationship of refusal to non-follow-up
- what kinds of refusals led to non-follow-up?
- what kinds of refusals led to follow-up?
Limitations
- non-follow-up and follow-up is more coarse-grained than what I initially wanted to do

pstore

Explorer

Machine Refusal Methods

Primary Dataset and Data Source

Ethical Considerations and Preprocessing

Analysis Procedure

Contingency Datasets

Data Filtering Results

Graph View

Table of Contents