• In-distribution: inputs that follow the same pattern as the data the model saw during training or the data used to build the simplification
  • Out-of-distribution: inputs that differ in ways the training/simplification data didn’t cover (new structures, deeper nesting, new languages, unseen patterns)
  • “Simplification”:
    • Take the key and query vectors from a trained attention head for ~1000 sequences from the training data
    • Form a big matrix of these key/query vectors
    • Compute the top singular vectors (eigenvectors) of this matrix (PCA)
    • For every future input, replace each original key/query vector with its projection onto those top components
    • Use the projected keys/queries to compute attention weights instead of the originals.
    • This helps identify what an attention head or MLP is “doing”—e.g. “This head tracks bracket depth,” “This head copies previous tokens,” “This circuit implements addition.”