Datasets in LLM Training

Data is used in all three stages of LLM development: unsupervised learning with text corpus, supervised learning with question and answer pairs, and reinforcement learning with human/AI labels.

Some notable datasets are:

ShareGPT
- Data from a sharing Chrome extension for their ChatGPT conversations
- Legal grey area, since most conversations are unlicensed and collected without consent
- Extensive use in thhe last 18 months, but starting to be replaced by carefully collected counterparts:
LMSYS-Chat-1M (https://lmarena.ai)
- Cleaned conversations from ChatBotArena
- Provides both a crowdsourced ranking of LLMs, as well as data for training instruction models
WildChat (https://wildchat.allen.ai)
- 1M real-world-use-ChatGPT interactions
- Constructed by offering free access to ChatGPT and GPT-4 in exchange for consensual chat history collection
Alpaca
Anthropic HH
WebGPT
Dolly
OpenAssistant
- Human generated, human-annotated assistant style conversation corpus
StableVicuna
- Trained with proximal policy optimization (PPO) on popular datasets
  - OAsst1 dataset for SFT + PPO
  - Anthropic HH + Standofd Human Preferences for RL
UltraFeedback

What is synthetic data?

Evaluation

ChatbotArena
- Side by side preference collection of two different models
- Pros:
  - At-scale, blind LLM community comparisons
  - Ranks top closed and open models
- Cons:
  - Do not control or know prompt or user distribution
  - Hard tool to base engineering decisions on
  - Only the best models get in
AlpacaEval
- LLM-as-a-judge Evaluation:
  - Evaluate model by measuring the faction of times a powerful LLM (GPT-4) prefers the outputs from that model over ouputs from a reference model
    - Basically ChatbotArena, but the evaluator is the powerful LLM
- Has 0.98 spearman correlation to human evaluation
MT Bench
- Similar to AlpacaEval, except:
  - Two turns (response and follow-up)
  - 7 categories (writing, role—play, math, coding, extraction, STEM, humanities)
  - Rate one model at a time from 0-10 to mitigate positional bias
Open LLM Leaderboard
- Evaluate almost any model on the hub, good for discovering models
Both of these LLM-as-a-judge evals suffer from scoring saturated at the top end, where the model that is the judge scores itself highly.
And evaluation of alignment, in general, are difficult to be translated into training signal and not useful for studying alignment

RLHF already has considerations for constraining the model not to trust the human too much!!

Using gradient ascent on this equation instead of training a reward model works (DPO)

Extremely simple to implement
Scales nicely with existing distributed training libraries
Trains an implicit reward function

Zephyr beta: made a splash with DPO, based on mistral 7B
Tulu 2: scale DPO to 70b

Genstruct: model for rephrasing any text into instructions
OLMo-Instruct: truly open-source models

Apparently,

we have severely limited preference datasets (anthropic HH, UltraFeedback (AI), and Nectar)
continuing to improve DPO
how do we get more specific evals than ChatBotArena?
personalization: a large motivation behind local models
- would love to use LocalLLaMa as a case study fo what kind of personalization people want to do
  - have a lot of drive to create a language model that they like to use
  - What are these drives? and how have they done it?

Alignment is really a UX problem…

Pitfalls of synthetic data: repetitiveness, and not robust distributions

Woohyeuk Lee

Graph View

Datasets in LLM Training

Graph View

Backlinks