A Mathematical Framework for Transformer Circuits

This post summarises what I think are interesting findings from Anthropic’s “A Mathematical Framework for Transformer Circuits” by Elhage et al. (2021). The paper presents early research in mechanistic interpretability to reverse-engineer transformer models into interpretable mathematical circuits.

To make progress on understanding transformers, the researchers focused on simplified "attention-only" models without MLP layers. This drastic simplification removes about two-thirds of parameters, but preserves core attention mechanisms that enable sophisticated behaviours.

Note: The matrix equations here differ from the tensor notation used in the paper. Nonetheless, they accurately represent the computational flow without unnecessary complexity.

The residual stream acts as a shared communication channel. Central to understanding transformers is the residual stream — the pathway through which information flows between layers. Each layer reads from and writes to specific subspaces of this stream, with information persisting unless actively modified.

$$ \mathbf{X}{\text{logits}} = \mathbf{X}{\text{tokens}} \mathbf{W}_E \mathbf{W}_U $$

Zero-layer transformers only implement bigram statistics. The simplest possible transformer directly maps tokens to predictions. This zero-layer model only implements bigrams, where the $\mathbf{W}_E \mathbf{W}_U$ matrix optimises for bigram statistics during training — [a] → [b] patterns like hello → world. This represents pure statistical pattern matching based on training frequency, with no algorithmic reasoning involved.

$$ \mathbf{X}{\text{out}} = \mathbf{X}{\text{in}} + \sum_{i=1}^{H} \text{softmax} \Big( \mathbf{X}{\text{in}} \mathbf{W}{Q}^{(i)} (\mathbf{W}{K}^{(i)})^{T} \mathbf{X}{\text{in}}^{T} \Big) \mathbf{X}{\text{in}} \Big( \mathbf{W}{V}^{(i)} \mathbf{W}_{O}^{(i)} \Big) $$

One-layer transformers implement skip-trigrams and primitive copying. Adding one attention layer enables skip-trigrams, where relevant tokens are separated by distance — [a] ... [b] → [a'] patterns like dog ... young → puppy. Notably, one-layer transformers dedicate enormous capacity to copying, which is a primitive form of in-context learning — [a] ... [b] → [a] patterns like pizza ... eat → pizza. Copying behaviour is evident from positive eigenvalues in the OV circuit ($\mathbf{M}_{OV}^{(i)} \mathbf{v}_k = \lambda_k \mathbf{v}_k$ where $\lambda_k > 0$), meaning the same set of tokens ($\mathbf{v}_k$) tend to amplify their own logits ($\lambda_k \mathbf{v}_k$) when attended to.