Transformers Explained
The architecture behind every modern LLM.
Introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al., the Transformer replaced recurrence (RNN/LSTM) and convolution with a single primitive: self-attention. By processing every token in a sequence in parallel, it unlocked the GPU-era scaling that produced GPT, Claude, Gemini, Llama, Mistral, DeepSeek, and Qwen.
Why it replaced RNNs
RNNs process tokens one at a time, so training cannot parallelize across the sequence and long-range gradients vanish. Transformers compute relationships between all pairs of tokens in one matrix multiplication, making them GPU/TPU friendly and able to model dependencies across thousands — now millions — of tokens.
The core primitive: scaled dot-product attention
Each token is projected into three vectors: a Query (Q), a Key (K), and a Value (V). Attention asks: 'for this query, which keys are relevant, and what values should I pull from them?' Mathematically: Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V. The √d_k divisor keeps the softmax in a stable range as dimensions grow.
Multi-head attention
Instead of one attention operation, the model runs h smaller attention 'heads' in parallel — each with its own Q/K/V projections. Different heads specialize in different relationships (syntactic agreement, coreference, positional patterns, factual recall). Their outputs are concatenated and linearly projected back to the model dimension.
Anatomy of one Transformer block
A decoder-only block — the design used by every modern LLM — repeats this structure N times (e.g. 32 layers in Llama 3 8B, 80+ in frontier models):
- LayerNorm / RMSNorm (pre-norm in modern LLMs)
- Causal multi-head self-attention with a triangular mask so a token can only attend to earlier tokens
- Residual connection (add input back)
- LayerNorm / RMSNorm
- Feed-forward network (FFN/MLP) — typically 4× the hidden size, now usually a SwiGLU gated activation
- Residual connection
Residuals are what let signals (and gradients) flow through 80+ layers without vanishing.
Positional information
Attention is permutation-invariant on its own — it has no idea which token came first. Position must be injected explicitly:
- Sinusoidal encodings (original paper) — fixed sin/cos signals added to embeddings.
- Learned absolute positions (GPT-2/3) — a trainable vector per position.
- RoPE (Rotary Position Embedding) — rotates Q and K in 2D subspaces by an angle proportional to position. Used by Llama, Mistral, Qwen, DeepSeek, and most modern open models. It generalizes better to longer contexts and is the de facto standard.
- ALiBi — biases attention scores by distance; another length-extrapolation approach.
Three architectural families
- Encoder-only (BERT, RoBERTa) — bidirectional attention, great for classification and embeddings.
- Encoder–decoder (T5, original Transformer) — encoder reads input, decoder generates output with cross-attention. Used for translation and seq2seq tasks.
- Decoder-only (GPT, Llama, Claude, Gemini, Mistral, DeepSeek, Qwen) — causal self-attention, trained to predict the next token. This is the architecture behind every modern frontier LLM, because next-token prediction at scale turns out to be a remarkably general objective.
How a token becomes a prediction
- Tokenization — text is split into subword tokens (BPE, SentencePiece, tiktoken).
- Embedding — each token id is looked up in an embedding matrix to produce a vector.
- N Transformer blocks — each block mixes information across tokens (attention) and across features (FFN).
- Final LayerNorm and an unembedding / LM head (often weight-tied to the input embedding) produce a logit for every token in the vocabulary.
- Softmax → sampling — temperature, top-k, top-p, or greedy decoding selects the next token, which is appended and fed back in. This autoregressive loop is what 'generation' actually is.
Modern upgrades you'll see in every recent LLM
- RMSNorm instead of LayerNorm — fewer parameters, same stability.
- SwiGLU activation in the FFN — beats ReLU/GeLU empirically.
- RoPE for positions — enables context-length extension.
- Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) — multiple query heads share fewer key/value heads, shrinking the KV cache and dramatically speeding up inference.
- FlashAttention — an IO-aware exact attention kernel that fuses operations to avoid materializing the full N×N attention matrix in HBM. Makes long contexts practical.
- Mixture of Experts (MoE) — replaces the dense FFN with many expert FFNs and a router that activates only a few per token. Mixtral, DeepSeek V3, and (reportedly) GPT-4 use MoE to grow total parameters without growing inference compute.
- KV cache — at inference, previously computed keys and values are cached so each new token only does O(N) attention work instead of O(N²).
Scaling laws and why this architecture wins
Kaplan (2020) and Chinchilla (2022) showed that loss falls predictably as a power law in parameters, data, and compute. The Transformer is the first architecture flexible and parallelizable enough to ride that curve for many orders of magnitude — which is why essentially every frontier model in 2024–2026, open or closed, is a decoder-only Transformer with the upgrades above.
Further reading
- Vaswani et al., 'Attention Is All You Need' (2017)
- Jay Alammar, 'The Illustrated Transformer'
- Andrej Karpathy, 'Let's build GPT' (YouTube) and nanoGPT
- 'FlashAttention' (Dao et al., 2022) and 'FlashAttention-2' (2023)
- 'RoFormer: Enhanced Transformer with Rotary Position Embedding' (Su et al., 2021)