Transformers Explained

Introduced in the 2017 paper 'Attention Is All You Need' by Vaswani et al., the Transformer replaced recurrence (RNN/LSTM) and convolution with a single primitive: self-attention. By processing every token in a sequence in parallel, it unlocked the GPU-era scaling that produced GPT, Claude, Gemini, Llama, Mistral, DeepSeek, and Qwen.

Why it replaced RNNs

RNNs process tokens one at a time, so training cannot parallelize across the sequence and long-range gradients vanish. Transformers compute relationships between all pairs of tokens in one matrix multiplication, making them GPU/TPU friendly and able to model dependencies across thousands — now millions — of tokens.

The core primitive: scaled dot-product attention

Each token is projected into three vectors: a Query (Q), a Key (K), and a Value (V). Attention asks: 'for this query, which keys are relevant, and what values should I pull from them?' Mathematically: Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V. The √d_k divisor keeps the softmax in a stable range as dimensions grow.

Multi-head attention

Instead of one attention operation, the model runs h smaller attention 'heads' in parallel — each with its own Q/K/V projections. Different heads specialize in different relationships (syntactic agreement, coreference, positional patterns, factual recall). Their outputs are concatenated and linearly projected back to the model dimension.

Anatomy of one Transformer block

A decoder-only block — the design used by every modern LLM — repeats this structure N times (e.g. 32 layers in Llama 3 8B, 80+ in frontier models):

LayerNorm / RMSNorm (pre-norm in modern LLMs)
Causal multi-head self-attention with a triangular mask so a token can only attend to earlier tokens
Residual connection (add input back)
LayerNorm / RMSNorm
Feed-forward network (FFN/MLP) — typically 4× the hidden size, now usually a SwiGLU gated activation
Residual connection

Residuals are what let signals (and gradients) flow through 80+ layers without vanishing.

Positional information

Attention is permutation-invariant on its own — it has no idea which token came first. Position must be injected explicitly:

Sinusoidal encodings (original paper) — fixed sin/cos signals added to embeddings.
Learned absolute positions (GPT-2/3) — a trainable vector per position.
RoPE (Rotary Position Embedding) — rotates Q and K in 2D subspaces by an angle proportional to position. Used by Llama, Mistral, Qwen, DeepSeek, and most modern open models. It generalizes better to longer contexts and is the de facto standard.
ALiBi — biases attention scores by distance; another length-extrapolation approach.

Three architectural families

Encoder-only (BERT, RoBERTa) — bidirectional attention, great for classification and embeddings.
Encoder–decoder (T5, original Transformer) — encoder reads input, decoder generates output with cross-attention. Used for translation and seq2seq tasks.
Decoder-only (GPT, Llama, Claude, Gemini, Mistral, DeepSeek, Qwen) — causal self-attention, trained to predict the next token. This is the architecture behind every modern frontier LLM, because next-token prediction at scale turns out to be a remarkably general objective.

How a token becomes a prediction

Tokenization — text is split into subword tokens (BPE, SentencePiece, tiktoken).
Embedding — each token id is looked up in an embedding matrix to produce a vector.
N Transformer blocks — each block mixes information across tokens (attention) and across features (FFN).
Final LayerNorm and an unembedding / LM head (often weight-tied to the input embedding) produce a logit for every token in the vocabulary.
Softmax → sampling — temperature, top-k, top-p, or greedy decoding selects the next token, which is appended and fed back in. This autoregressive loop is what 'generation' actually is.

Modern upgrades you'll see in every recent LLM

RMSNorm instead of LayerNorm — fewer parameters, same stability.
SwiGLU activation in the FFN — beats ReLU/GeLU empirically.
RoPE for positions — enables context-length extension.
Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) — multiple query heads share fewer key/value heads, shrinking the KV cache and dramatically speeding up inference.
FlashAttention — an IO-aware exact attention kernel that fuses operations to avoid materializing the full N×N attention matrix in HBM. Makes long contexts practical.
Mixture of Experts (MoE) — replaces the dense FFN with many expert FFNs and a router that activates only a few per token. Mixtral, DeepSeek V3, and (reportedly) GPT-4 use MoE to grow total parameters without growing inference compute.
KV cache — at inference, previously computed keys and values are cached so each new token only does O(N) attention work instead of O(N²).

Scaling laws and why this architecture wins

Kaplan (2020) and Chinchilla (2022) showed that loss falls predictably as a power law in parameters, data, and compute. The Transformer is the first architecture flexible and parallelizable enough to ride that curve for many orders of magnitude — which is why essentially every frontier model in 2024–2026, open or closed, is a decoder-only Transformer with the upgrades above.