Bonus · Chapter 16

LLMs and All Their Fun Magic

From Attention Is All You Need to models that actually think. Tokens, embeddings, QKV, transformers, RLHF, sampling, KV caches, MoE, chain-of-thought, o1-style reasoning — the whole stack, from scratch.

Alright. You've made it through fifteen chapters of slowly-but-surely learning the actual machinery of machine learning. You can do linear regression by hand. You've argued with a decision tree. You've made a tiny neural net conquer XOR like a toddler defeating a Rubik's Cube.

Now we're going to build the thing that broke the internet.

Large Language Models. ChatGPT. Claude. Gemini. Llama. The reason your grandma now knows what an "AI" is. The reason every startup landing page these days has the word "agent" on it.

By the end of this chapter you'll know exactly what happens between the moment you hit Enter and the moment a model thinks for forty seconds and hands you back a better answer than a junior engineer. No magic. Just a giant tower of multiplications and a handful of disgustingly clever ideas, stacked taller than they have any right to be.

Fair warning: this is the longest chapter in the book. It's also the best one, and those two facts are not unrelated.

The Setup: What Even Is an LLM?

Strip away the marketing and an LLM is one sentence: a giant neural network trained to predict the next word in a sequence of words.

That's it. That's the trick. You feed it "The cat sat on the" and it outputs a probability distribution over its vocabulary saying things like:

mat → 38%
rug → 14%
couch → 9%
floor → 7%
… (the other ~99,996 tokens)

You sample one. You glue it onto the end. You feed the new string back in and ask for the next one. You do that 500 times. Suddenly, you've written a poem about regret.

That entire output — the essays, the code, the "you're absolutely right!" apologies — is just an autoregressive loop over "guess the next word", trained on basically the whole internet. The trick is in how the guessing works inside.

The architecture that made it all possible is called the Transformer, introduced in a 2017 paper with the unforgettably arrogant title Attention Is All You Need. They were right.

The journey looks like this:

Tokenize — turn the input string into integer IDs.
Embed — turn each ID into a vector.
Add positions — so the model knows what came first.
Attention — every token looks at every other token.
MLP / FFN — mix it all up some more.
Stack steps 4 + 5 a few dozen times.
Unembed — project back to vocab-size logits.
Sample — pick a token. Glue. Repeat.

That builds the brain. Then we still have to teach it (pretrain), give it manners (post-train), make it run fast (inference), bolt on the modern upgrades, and finally make it actually think. Eleven steps. One story. Let's build it.

Step 1: Tokenization — Why Words Are a Lie

A neural net does math on numbers. Strings aren't numbers. So before anything else we need to chop the input string into pieces and assign each piece an integer ID. This is tokenization, and it's lowkey the most underrated part of the whole stack.

You have three options:

Character-level — vocab is ~256, but the sequences are huge and the model wastes capacity learning that "th" is common.
Word-level — short sequences, but the vocab explodes past a million once you count typos, names, URLs, and every Unicode script ever. Every unknown word becomes <UNK>. Death for code.
Subword (BPE) — the sweet spot. Frequent words stay whole, rare words shatter into reusable pieces. Zero OOV ever.

Modern LLMs all use subword. The dominant algorithm is Byte-Pair Encoding (BPE), originally a 1994 compression trick repurposed for NLP in 2016. Here's the algorithm — it's shockingly simple:

Initialize vocab with the base units — raw bytes (0–255) for byte-level BPE, or plain characters for the classic flavor.
Pre-split the corpus into words.
Count all adjacent symbol pairs.
Merge the most frequent pair into a new symbol; add it to the vocab.
Repeat until vocab hits your target size (say 50,000).

Encoding new text just applies the merges in the same order they were learned. Greedy, deterministic, fast.

Why "how many R's in strawberry?" trips up GPT.

Two real-world flavors do roughly the same job — text in, vocab IDs out:

SentencePiece (Google, used by Llama, T5, Gemma) treats input as raw Unicode and is language-agnostic — spaces and all.
tiktoken (OpenAI) is a blazing-fast Rust byte-level BPE used by GPT-3.5/4 with cl100k_base; GPT-4o moved to a chunkier o200k_base.

Vocab sizes worth knowing in 2026:

Llama 2: 32,000
Llama 3: 128,256 (4× bigger — big jump in multilingual and code quality)
GPT-4 (cl100k_base): 100,277
GPT-4o (o200k_base): ~200,000
Gemma: 256,000

Bigger vocab = shorter sequences = cheaper inference, but a fatter embedding matrix and softmax. There's a tradeoff and modern models keep edging up.

Here's the spicy part nobody tells you: tokenizers are trained mostly on English, so English is cheap and every other language pays rent. The exact same sentence in Burmese or Telugu can shatter into 5–10× more tokens than in English — same meaning, same idea, way bigger bill. You literally get charged more for thinking in your own language, and it eats your context window faster too. Tokenization isn't just plumbing — it's a quiet, accidental tax on half the planet.

Practical kicker: this is also why APIs bill you per token, not per word. English prose runs about 0.75 words per token, but a wall of JSON, code, or weird Unicode tokenizes way worse — sometimes one token per character. If your bill looks suspiciously high, check what you're feeding it. Whitespace-heavy code is secretly expensive.

The Strawberry Bug

Ask GPT-4 "how many R's are in strawberry" and it might confidently say 2. People treat this as a reasoning failure. It isn't. It's a tokenization failure.

The word strawberry tokenizes into three opaque chunks like [str, aw, berry]. The model literally never sees individual letters — it sees three opaque vocab IDs. Asking it to count R's is like asking you to count the bones in a hot dog you only saw whole and pre-packaged.

Same reason LLMs are bad at:

Reversing strings.
Counting syllables.
Rhyming reliably.
Arithmetic on really long numbers (each digit can be a separate token, and adjacent digits get glued unpredictably).

It's a representation bug, not a reasoning bug. Knowing that distinction is the difference between sounding smart about LLMs and sounding like a guy who read one tweet.

Tokenization also gets genuinely cursed. In 2023 researchers found "glitch tokens" — strings like SolidGoldMagikarp (a Reddit username scraped into the vocab but barely seen during the model's actual training). Ask an old GPT to repeat one and it would lose its mind — insult you, dodge, hallucinate a completely different word. The model had a token for a thing it had basically never learned: a vocab entry with no real meaning behind it. Haunted houses in the token table.

Want to catch the strawberry bug live? Open ChatGPT and ask it to spell "strawberry" out one letter at a time — it usually nails that, because now each letter is its own token. Then ask it to count the R's in one shot and watch it fumble. Same model, same word, different tokenization. You just diagnosed a bug a billion users have blamed on "AI being dumb."

And the real lesson: never trust an LLM with exact string surgery — "remove every third comma," "count the vowels." It can't see characters. If a task needs character-level precision, make the model write code that does it, then run the code. The LLM is the manager, not the intern.

Here's a toy interactive — type anything, watch your sentence shatter:

Live · type text, watch it shatter into tokens

Transformers·tokenize·unbelievable·preprocessing-heavy·sentences·into·subwords.

chars

tokens

chars / token

4.39

Real BPE tokenizers (GPT, Llama, Mistral) learn their splits from huge corpora — this toy peels common prefixes/suffixes so you can see the idea: chop big words into reusable pieces, drop them into a fixed vocab, never see an OOV again.

Try this: type "strawberry" and count the token boundaries. You see the bug before we even finish explaining it.

And here's a from-scratch BPE trainer in <25 lines. Run it. Watch merges happen.

bpe.py

Run that and watch the algorithm discover common pieces like est, er, low all by itself, just by counting. It's really underwhelming once you see it. Which is the point. There's no magic. The model is eating a quietly clever compression scheme.

Step 2: Embeddings — Tokens Get a Personality

Integer IDs are a great start and a useless one — a neural net can't multiply a token's name. It needs vectors. Cue the embedding matrix: a giant lookup table of shape (vocab_size × d_model) where every token gets a personality.

Token ID 7842 just means "grab row 7842 from this matrix." That row is a d_model-dimensional vector — typically 768, 4096, or 12288 floats. That vector starts random, and through training it becomes the model's "internal representation" of that token. Similar tokens land near each other. ("Random" is doing a lot of heavy lifting there — at the start of training the model genuinely thinks "king" and "mayonnaise" are basically neighbors. Training fixes this. Slowly.)

The wild part: this space has geometry. The classic party trick is king − man + woman ≈ queen — the vector gap between "king" and "man" literally encodes a direction that means "royalty," and you can do arithmetic with concepts. Nobody designed this. The model stumbled into building a meaning-shaped coordinate system because it was the laziest way to predict the next word.

Quick déjà vu: remember K-Means from Chapter 9, where we found cliques by measuring how close points sat in space? Embeddings are that idea turned inside out. Instead of clustering pre-made vectors, the model learns the vectors so related tokens drift into the same neighborhood. Chapter 9 measured distance. Here, distance is the meaning.

GPT-2 small: 50,257 × 768
Llama 3 8B: 128,256 × 4096
GPT-3 175B: 50,257 × 12,288

Mathematically this is equivalent to a one-hot vector times E, but no sane person actually does that. You just index: E[token_ids].

Money-saving hack: the same matrix E often gets reused — flipped around — as the final layer that turns vectors back into vocab logits. It's called weight tying, and it makes sense: if two tokens mean similar things going in, they should look similar coming out. One matrix, two jobs, millions of parameters saved. The model is, once again, a lazy genius.

Positional Encoding — Order Matters, Bro

Here's a problem: self-attention is permutation-equivariant — a ten-dollar phrase for a simple, slightly horrifying fact. Shuffle the input tokens and the output shuffles identically. To the model, "dog bites man" and "man bites dog" are the same bag of vectors.

That's catastrophic for language. We have to tell the model about order. The fix: add a position-dependent vector to each token embedding before it enters the first block. It sounds too cheap to work — but if every position gets its own distinct fingerprint vector, the model can learn to read that fingerprint and figure out "ah, this token is the 5th one." Order, smuggled in through addition.

Token embeddings + positions → the input to block 1.

There have been four good ideas about how to do this:

Sinusoidal (2017): a fixed sine/cosine pattern of geometrically-spaced frequencies. Low dims tick fast, high dims tick slow — like a multi-resolution clock. Intuitive, never needs training.
Learned (BERT, GPT-2): another (max_seq_len × d_model) matrix, trained from scratch. Flexible but hard-capped at the training length — GPT-2 dies past 1024 tokens.
RoPE (2021): the modern winner. Instead of adding, you rotate the Q and K vectors inside attention by an angle that depends on position. The math conveniently makes attention scores depend only on relative distance. Used by Llama, Mistral, Qwen, DeepSeek, basically everyone good. Vanilla RoPE doesn't magically extrapolate past its training length — but unlike learned positions it's hackable: because it only cares about the angle between positions, you can quietly squish those angles closer at inference (NTK scaling, YaRN) and trick a model trained on 4k tokens into swallowing 128k — like fitting a longer movie on the same reel by playing it slightly slower.
ALiBi (2022): skip positions entirely. Add a linear penalty -m·|i-j| to attention scores. Simple, extrapolates great. Used by BLOOM and MPT.

The original 2017 sinusoidal formula:

PE (p os, 2 i) = sin (\frac{p os}{1000 0 ^{2 i / d_{model}}}), PE (p os, 2 i + 1) = cos (\frac{p os}{1000 0 ^{2 i / d_{model}}})

Here it is in <15 lines of numpy:

positional_encoding.py

Step 3: Attention — Everyone Listens to Everyone

This is the part that won the Turing Award (well, will). It's the heart of the whole machine.

Fun fact that should make you respect the chaos of science: Attention Is All You Need almost wasn't a thing. The eight authors hammered it together in a frantic sprint, some still tweaking experiments days before the deadline, and exactly none of them predicted it would eat the entire field. Within a few years every single one had left Google to start their own company. The most important AI paper of the decade was basically a side quest.

The pre-Transformer world used RNNs — networks that processed tokens one at a time and compressed all of history into a single hidden state. By the time you reached token 200, token 3 was a faint smell. Gradients vanished. Memory was a sieve.

Attention's pitch: stop compressing. Let every token directly look at every other token in the context and pull what it needs. No bottleneck. No forgetting. Just a soft database lookup over the entire sequence.

The trick uses three vectors per token, made by three learned matrices:

Query (Q): what I am looking for. ("Tall, likes dogs.")
Key (K): what I am advertising. ("I am tall, I like dogs.")
Value (V): what you actually get if matched. (the actual person.)

Yes, attention is essentially a dating app for tokens. A token's Q gets dot-producted against every other token's K to score compatibility. Softmax turns scores into probabilities. Then we mix the V vectors weighted by those probabilities. The output is what this token "found" in the context.

The Formula

Given input X of shape (n × d_model), compute:

Q = X W_{Q}, K = X W_{K} \in R^{n \times d_{k}}, V = X W_{V} \in R^{n \times d_{v}}

(V technically gets its own width d_v, but almost everyone sets d_v = d_k and so will we.)

Then the actual attention operation:

Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

Scaled dot-product self-attention, end to end.

Why divide by √d_k?

That word "softmax" should be ringing a bell — it's the exact same function that turned raw scores into a tidy probability distribution back in logistic regression and multi-class classification in Chapter 5. Zero new math. The only difference is what we're scoring.

Now, if Q and K entries behave like independent unit-variance noise (which they roughly do at init), the dot product q·k has variance d_k — each of the d_k terms is its own little unit-variance number, and variances of independent things add up. So the scores grow with √d_k.

Big scores push softmax into a regime where one entry hogs ~all the probability and the gradient signal to everything else basically flatlines. (Softmax, given the chance, will always pick a favorite child and ghost the rest.) If "gradients flatline" gave you flashbacks — good. That's the vanishing gradient ghost from Chapter 10, the same reason we threw sigmoid out of hidden layers. Dividing by √d_knormalizes variance back to 1 and keeps softmax in its gradient-friendly zone. Boring math. Crucial fix.

Causal Masking — No Peeking

For a decoder-style LM (GPT, Llama, Claude), token i must not be allowed to attend to tokens j > i. Otherwise during training the model just cheats by peeking at the next word and we've trained an extremely expensive identity function.

Fix: before softmax, add a mask matrix with -∞ in the upper triangle. softmax(-∞) = 0. The future is dark. The encoder side of BERT skips this — it's allowed to look both ways.

Practical kicker: because every token only attends to what came before it, the order you write a prompt matters. Put the instruction before the data, not after. "Summarize this: [10 pages]" beats "[10 pages] — summarize that," because in the second one the model slogged through ten pages with no idea what the assignment was.

Complexity

The QKᵀ matmul is O(n² · d_k) time and O(n²) memory — and that O(...) is just CS shorthand for "how fast the cost grows": O(n²) means double the input, quadruple the pain. So double the context and you quadruple the work. This is why 1M-token context windows are a hardware sport, not a software trick — and why a pile of clever fixes exist. We'll meet them next.

Practical kicker: this same O(n²) is why a giant prompt costs more than its token count suggests. Stuffing your entire codebase into one prompt "just in case" isn't free — it's slow and pricey. Send what's relevant, not what's available.

Here's the whole thing from scratch in numpy. Read it carefully — every modern LLM does exactly this, just bigger:

attention.py

And here's a tiny interactive so you can see attention flow on a real sentence:

Live · self-attention heatmap (decoder-style, causal mask)

softmax temperature0.40

Each row is one token asking, "who should I listen to?" The orange cells are where attention flows. The upper-right triangle is dark because we mask future tokens — decoders can't peek ahead. Crank temperature ↑ and the model gets "confused" (uniform); push it ↓ and it locks onto a single neighbour.

Try this: feed it "she gave him a book" — watch which words "him" leans on.

A Common Misconception

Attention weights are not interpretability oracles. A big paper from 2019 (Jain & Wallace) showed you can permute attention weights and the model still works. Multiple heads, residual streams, and MLPs scramble the "who looked at who" story beyond easy reading.

The story has a sequel, though. A 2019 rebuttal cheekily titled "Attention is not not Explanation" (Wiegreffe & Pinter) pointed out the permuted-weights trick is a bit of a magic act — those adversarial permutations are hard to actually find, and under stricter rules attention can carry real signal. Verdict after all the academic slap-fighting: attention weights are a clue, not a confession. Use them, don't worship them.

One genuinely spooky discovery: some heads grow into induction heads all on their own. Show the model "…Harry Potter… Harry" and the head goes "ah, last time I saw Harry the next token was Potter — copying that." Nobody programmed it. Anthropic's interpretability crew (Olsson et al., 2022) argued it's a big chunk of where in-context learning actually comes from. Pattern-matching, all the way down.

And the Backward Pass

We gave the forward pass a whole loving section. Here's the half nobody draws: gradients flowing backward through attention. The loss tugs on the output, which splits three ways — back through V (easy, it's just a weighted sum), and back through the softmax into the QKᵀ scores. That softmax Jacobian is the spicy bit: each score's gradient depends on every other score in its row. So one wrong word at position 200 sends correction signals fanning out to every token it attended to, all at once, in one matmul. RNNs had to whisper that gradient down a 200-step chain. Attention just mails it directly. That's the whole revolution in one sentence.

Step 4: Multi-Head Attention — A Committee of Specialists

One attention head is forced to do one kind of looking. But "the cat that the dog chased" needs syntactic attention (subject-verb), coreference attention (pronoun-antecedent), and positional attention all at once.

So we run several attention operations in parallel — call them heads — each with their own Q, K, V projections, each looking at a different subspace. Then we concatenate and project back. Same parameter budget, sliced into parallel views.

Multi-head attention: parallel committee, then merge.

Concrete head counts:

GPT-3 small: 12 heads × 64 dim
GPT-3 175B: 96 heads × 128 dim
Llama 3 70B: 64 query heads
PaLM 540B: 48 heads

Probing studies (Clark et al., 2019) found that BERT heads do specialize — some track direct objects, others determiners, others coreference. Nobody asked them to. They just did.

MQA, GQA — Saving the KV Cache

At inference time, the bottleneck isn't compute. It's reading the KV cache — the stash of every past token's K and V vectors the model keeps around so it doesn't recompute them every step (full story in Step 9) — from GPU memory. Multi-head attention stores K and V per head, which is a lot of memory.

MQA (Multi-Query Attention): keep N query heads but share one K and one V across all of them. KV cache shrinks by N×. Quality dips slightly. Used in PaLM.
GQA (Grouped-Query Attention): group the query heads carpool-style and share one K/V per group. Llama 2 70B runs 64 Q heads but only 8 KV heads — eight cars, sixty-four passengers. Llama 3, Mistral, Mixtral all use GQA. You get near-MHA quality with near-MQA memory.
The punchline: GQA isn't some clever new math — it's the lazy-genius move of noticing you have 64 query heads but only need about 8 opinions on what to look at. Llama 3 ships every size with it, even the 8B. Free memory, basically no quality tax.

FlashAttention — Not Math, Plumbing

Tri Dao's 2022 paper noticed: attention isn't slow because of FLOPs, it's slow because you keep shuttling the n×n attention matrix between slow HBM (high-bandwidth memory) and fast SRAM on the GPU. FlashAttention tiles Q, K, V into SRAM-sized blocks and computes softmax incrementally with an "online softmax" trick. The n×n matrix is never materialized. Same math. Same gradients. 2–4× faster. Sub-quadratic memory.

FlashAttention-1 (2022), -2 (2023), -3 (2024 on Hopper async + FP8). It is now what every serious framework uses under the hood. You probably never write attention by hand in 2026.

The wild part: FlashAttention is not an approximation. It is bit-for-bit the same attention, same gradients — it just stopped being dumb about which memory it talks to. (In the backward pass it even recomputes the attention blocks instead of storing them, which is the actual source of the "sub-quadratic memory" win.) Tri Dao basically looked at the GPU memory hierarchy the way you'd look at a badly-organized fridge and said "this, but staged correctly." That single plumbing fix is why million-token context windows exist at all.

multi_head_attention.py

Step 5: The Transformer Block — Lego Brick of Genius

One attention layer alone is dumb. Stack a bunch of them and add a few accessories and you get the actual repeating unit of a Transformer: the block.

The cleanest way to think about a block (and Anthropic's interpretability team popularized this) is as a residual stream — a shared vertical bus running through the model (think a shared whiteboard every layer can scribble on, never erasing — just adding notes). Every block reads from it, computes something, and writes back additively. Nothing overwrites — it's always x = x + something. The residual stream is the communication channel. Delete the residual connections and the model forgets how to talk to itself.

One block: two taps on the residual stream.

The modern (Llama-style) block does two taps on this bus:

Normalize a copy of the current state.
Run multi-head attention on it.
Add the result back to the bus.
Normalize a copy again.
Run a feed-forward network (FFN) on it.
Add the result back to the bus.

Pre-Norm vs Post-Norm

The original 2017 paper used post-norm: x = LayerNorm(x + Sublayer(x)). Norm sits outside the residual, so gradients get re-normed every layer. Hard to train deep without warmup and careful init.

Modern models all use pre-norm: x = x + Sublayer(LayerNorm(x)). Norm is inside. The residual is a clean highway. Gradients flow back untouched. You can train 70B+ models without babysitting the learning rate.

LayerNorm vs RMSNorm

LayerNorm subtracts the mean, divides by std, applies learned scale and bias. RMSNorm (2019) drops the mean centering:

RMSNorm (x) = \frac{x}{\frac{1}{d} \sum _{i} x _{i}^{2} + ϵ} ⊙ γ

Fewer operations, slightly faster, no bias. Empirically just as good. Llama, Mistral, Gemma, Qwen all use RMSNorm. It's one of those quiet upgrades that aged great — and it works because the mean-subtraction in LayerNorm was mostly doing nothing. Re-centering vectors turned out to be the gym membership of deep learning: everyone paid for it, almost nobody needed it.

The FFN — Where Most of the Parameters Actually Live

Everyone obsesses over attention. But roughly two-thirds of a typical dense block's parameters live in the feed-forward network. The classic FFN is two linear layers with an activation between them:

FFN (x) = W_{2} \cdot GELU (W_{1} x)

The intermediate dimension is typically 4× the hidden dim — wide enough to mix dimensions and act as a giant key-value memory, narrow enough to fit the compute budget.

And "key-value memory" is not a metaphor — it's the mechanism, and it's gorgeous. The first matrix W₁ is a stack of detector neurons; each row is a pattern, and W₁x lights up the ones that match the current token's vibe ("this is about Paris," "this smells like Python code"). The activation gates them. Then W₂ is a stack of writer vectors — each fired neuron dumps its associated content back into the residual stream. Keys in W₁, values in W₂, addressed by content instead of index. Geva et al. (2021) went full CSI on this and found you can scrub a single fact by editing one row. The model's "knowledge of France" has a street address.

SwiGLU — The Gated Upgrade

Modern models swap GELU for SwiGLU (Shazeer 2020):

SwiGLU (x) = W_{d} \cdot (SiLU (W_{g} x) ⊙ (W_{u} x))

It's a gated activation — one branch decides how much to let through, the other decides what to pass. Three matrices instead of two, so to keep the param count honest Llama uses the 2/3 trick: intermediate ≈ (2/3) · 4 · d_model, rounded to a friendly multiple of 256.

Llama 3 8B concrete numbers: hidden_size = 4096, intermediate_size = 14336. Roughly the (2/3) · 4 rule — Llama actually sneaks in a 1.3× fudge factor, then snaps it to a multiple of 256, because GPUs love round numbers like cats love boxes. SwiGLU buys a small but consistent perplexity win over plain GELU at equal compute.

transformer_block.py

Step 6: Stack 'Em Tall — The Whole Transformer

One block is a smart Lego brick. Now we do the only thing anyone has ever done with Lego: stack it until it's taller than reason. The full model is just: tokens → embed → add positions → block → block → block → … (N times) → final norm → unembed → softmax over vocab.

The whole show: embed → N blocks → unembed.

That's the whole machine. The depth budget is mostly "how many blocks can you afford to train."

GPT-2 small: N = 12
GPT-2 XL: N = 48
Llama 3 8B: N = 32
Llama 3 70B: N = 80
Llama 3.1 405B: N = 126

Three Architectures, One Winner

Historically there were three transformer flavors, all from the same family of Lego bricks but stacked differently:

Same Lego bricks. Three instruction manuals. One ate the planet.

Encoder-only (BERT, 2018): bidirectional attention, masked language modeling, output is a tower of contextual vectors that you fine-tune with a head for classification, retrieval, etc. Still rules where you need understanding: embeddings (SBERT, ColBERT), rerankers, toxicity classifiers. Every vector DB you've ever used has a BERT-ish model behind it. (Fun aside: BERT's siblings are an actual Muppet cinematic universe — ELMo came first, then BERT, then ERNIE, then Big Bird. NLP researchers in 2018 collectively decided the Sesame Street bit could not wait.)
Encoder-decoder (T5, BART): the shape the original 2017 transformer used. Encoder reads, decoder generates with cross-attention to the encoder's output. Still useful for translation, summarization. Also: image diffusion models often pair a text encoder (CLIP or T5) with a denoising decoder.
Decoder-only (GPT, Llama, Claude, basically everything famous): just a causal LM. One objective: next-token prediction. No special heads, no MLM trickery, no cross-attention. Decoder-only is next-token prediction wearing no costume.

Decoder-only won because:

Unified objective. Translation, summarization, code, math, chat — all just "continue this string."
Scalability. One tensor flow, no encoder/decoder imbalance, dead simple to shard.
Generation-native. Sampling falls out for free.

T5 framed everything as text-to-text in 2019; GPT-3 made it dogma in 2020 by proving scale + next-token is enough for in-context learning. Show the model a few examples in the prompt and it learns the task. No fine-tuning. Just vibes.

"Just vibes" is a cop-out, so here's the real story. The model was never trained to learn from examples — but predicting the next token across the whole internet forces it to. Tons of training text looks like "pattern, pattern, pattern, → ?", so the model gets very good at one meta-skill: spot the pattern in the prompt and continue it (those induction heads from Step 3 doing the heavy lifting). Your few-shot examples aren't training the model — no weights change. They're configuring a pattern-matcher that already exists. The "learning" happens entirely inside one forward pass, in the activations.

Step 7: Pretraining — From Random Noise to Shakespeare

An untrained Transformer is a cathedral of random numbers. Beautiful, useless, and about to eat the GDP of a small country. Getting it from random weights to coherent English is the most absurdly compute-hungry leg of the journey.

The objective is dead simple. Given all the tokens before position t, predict the token x_t.

L = - \frac{1}{N} t = 1 \sum N lo g P_{θ} (x_{t} ∣ x_{< t})

Look closely at that loss — it's cross-entropy, the exact same loss our tiny XOR net minimized in Chapter 10 and our classifiers used in Chapter 5. Nothing fancy got invented for GPT. We took the loss you already coded by hand, pointed it at "predict the next token," and poured fifteen trillion tokens through it. No labels, no humans — the text is the label, just shifted by one. Supervision is free. Scale is the only new ingredient.

Pretraining: shift-by-one, repeat a trillion times.

The Corpora (Where the Tokens Come From)

Common Crawl: petabytes of raw web scrape, ~95% garbage.
RefinedWeb (Falcon, 2023): ~5T filtered tokens.
FineWeb (HuggingFace, 2024): 15T tokens, heavy dedup — its FineWeb-Edu subset (~1.3T) keeps only the "would a teacher approve of this?" pages.
The Pile (EleutherAI, 2020): 825GB, 22 sources (ArXiv, GitHub, PubMed, Books3).
Code mixes: The Stack v2 (~900B tokens), StarCoder data. Code boosts reasoning even on non-code benchmarks. Yes, training on code makes the model better at math.

Scale: Numbers That Should Scare You

Model	Params	Training Tokens
GPT-3	175B	300B
Llama 2 70B	70B	2T
Llama 3 8B / 70B	8B / 70B	15T
Llama 3.1 405B	405B	15.6T

Llama 3.1 405B reportedly used ~3.8 × 10²⁵ FLOPs across 16,000 H100s for ~54 days. An H100 is about $30k. Training is a one-time bill; inference is rent — which is exactly why the industry now over-trains small models, but more on that shortly.

Let that 16,000-GPU number sink in, because those GPUs don't just compute — they fail. At that scale a frontier run hits a hardware fault every few hours: a GPU dies, a network link flakes, a node falls over. Meta logged hundreds of interruptions across the Llama 3 run. The training loop you'll see below has a checkpoint step in it for a reason — without it, one cosmic ray sets $50M on fire.

Funny in hindsight: when OpenAI built GPT-2 in 2019, they initially declined to release the full model, calling it "too dangerous," and staged it out over months starting with a tiny 124M version. GPT-2 is now something you run on a laptop for fun. Capability moves so fast that yesterday's apocalypse is today's tutorial project.

The Optimizer

The optimizer is AdamW — the same trusty workhorse from earlier chapters, just with the dials cranked to "industrial." Decoupled weight decay ~0.1, β₁ = 0.9, β₂ = 0.95, ε = 1e-8. The learning rate doesn't just start hot — you ease it in with ~2000 steps of linear warmup (so the model doesn't faceplant on step one), then cosine-decay it down to ~10% of peak, like slowly lowering the heat on a stove. Peak LR sits around 3e-4 for the big boys, 6e-4 for the smaller ones. Nobody derived these numbers from first principles, by the way — they were found the way most deep learning constants are found: someone tried a bunch and the loss curve looked happy.

And underneath all of it: weight = weight − learning_rate × gradient. That's gradient descent from Chapter 3 — the lazy-genius algorithm — wearing a very expensive suit. AdamW just gives each weight its own adaptive step size. Same hill, same downhill walk, just 16,000 GPUs taking it together. Global batches are massive — Llama 3 used ~16M tokens per batch, achieved via gradient accumulation across thousands of those GPUs.

Parallelism — The Real Engineering

A 70B fp32 model needs ~280GB just for weights. An H100 has 80GB. So you split everything:

Data parallel (DP): same model, different data shards.
Tensor parallel (TP): split a single matmul across GPUs (Megatron-LM).
Pipeline parallel (PP): layers 1–10 on GPU A, 11–20 on B, microbatches keep the assembly line full.
ZeRO / FSDP: shard optimizer states, gradients, and weights across the DP ranks.
3D / 4D parallelism: stack DP × TP × PP simultaneously. Llama 3 added context parallelism for long sequences. Four axes of pain.

Mixed precision: bfloat16 in the forward/backward (wider exponent than fp16, no loss scaling needed), fp32 master weights and optimizer states.

Scaling Laws — Chinchilla's Lesson

Kaplan et al. (2020) found loss is a power law of compute, params, and data. They suggested scaling params faster than data. Two years later DeepMind's Chinchilla (2022) said: actually no. Optimal compute splits roughly equally between parameters and tokens — about 20 tokens per parameter. GPT-3 (1.7 tok/param) was wildly under-trained.

Modern reality: Llama 3 8B trained on 15T tokens ≈ 1875 tok/param — about 100× past the Chinchilla optimum. Why? Training cost is one-time; inference is forever. A smaller, over-trained model is cheaper to serve every single day it exists. Training is a one-time bill, inference is rent, and the whole industry now over-trains small models for exactly that reason.

Plot twist nobody mentions: Chinchilla itself had a math slip. A 2024 re-analysis (Besiroglu et al.) found the original paper's third scaling-law fit was off — the curves actually agree better than DeepMind realized. The "20 tokens per parameter" rule of thumb survived; the confidence intervals did not. Even the people doing the scaling laws fat-finger the spreadsheet.

Are Emergent Abilities Real?

Wei et al. (2022) claimed certain abilities (multi-step arithmetic, multi-hop reasoning) appear suddenly past a scale threshold — phase transitions. Then Schaeffer et al. (2023) argued these jumps are artifacts of discontinuous metrics (exact-match accuracy). Switch to token-level log-likelihood and the curves are smooth. Real capability gain — fake phase transition. The honest answer is somewhere in the middle and depends on what you're measuring.

Worth knowing how fast this can go wrong: in November 2022 Meta released Galactica, a model trained on 48 million scientific papers and demoed as a tool to "summarize science." It confidently generated authoritative-sounding fake research, complete with invented citations, and Meta pulled the public demo after three days. A model fluent in the shape of truth is not the same as a model that knows it.

training_loop.py

Step 8: Post-training — How a Base Model Becomes a Chat Model

Pretraining gives you a smart-but-feral model. Show a raw base model "What is the capital of France?" and it might continue with "What is the capital of Germany? What is the capital of Spain?" because the internet has lots of quiz lists. It doesn't know that it's supposed to be a helpful assistant. That's on us.

RLHF (top) vs DPO's shortcut (bottom). Same destination.

Stage 1: SFT (Supervised Fine-Tuning)

Curate (prompt, ideal_response) pairs and continue training with cross-entropy on the response tokens only — we don't want to train the model to generate the user's prompt, it didn't write that. Loss only on the tokens it's actually responsible for: the answer. Famous datasets:

Alpaca: 52k instructions, ChatGPT-generated.
OpenAssistant: crowd-sourced multi-turn.
ShareGPT: scraped ChatGPT logs.
Dolly, FLAN, etc.

After SFT the model follows instructions but is often bland and confidently-wrong, and has no idea what humans prefer among many correct answers.

Stage 2: RLHF — The InstructGPT Recipe

OpenAI's 2022 three-stage pipeline that birthed ChatGPT:

SFT (above).
Reward Model (RM). Show humans two completions A and B, ask which is better. Then train a "scalar-head model" — our transformer with the vocab-sized output swapped for a single number, a quality score — with the Bradley-Terry loss (a classic stats recipe for "A beat B" data; the loss just nudges the winner's score above the loser's): $L_{R M} = - lo g σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))$
PPO. Roll out completions from the policy, score with the RM, then maximize this objective — the RM reward minus a KL penalty back to the SFT model so the policy doesn't drift into RM-hacking gibberish — with Proximal Policy Optimization: $J_{R L H F} = E [r_{ϕ} (x, y) - β \cdot KL (π_{θ} ∥ π_{S F T})]$

It works, but PPO is a finicky beast — four models in memory (policy, reference, RM, value network), unstable, expensive, and tuning β feels like alchemy.

(Quick gloss, since the book hasn't leaned on it before: RL — reinforcement learning — just means let the model try stuff, reward the tries that worked, and it slowly learns the winning behavior. No labeled "correct answer," just a thumbs-up signal.)

And here's how reward hacking sneaks in, concretely. Human raters, skimming fast, tend to upvote answers that look thorough — long, bulleted, hedged, headers everywhere. The RM faithfully learns "longer + listier = better." Then PPO, doing its job perfectly, discovers it can farm reward by padding every answer into a LinkedIn post. Nobody asked for this. The model isn't lying — it found a real correlation in the RM and exploited it ruthlessly, because that is literally the objective. Sycophancy is the same bug in a nicer outfit: raters like being agreed with, so the model learns agreement is free reward. The KL leash is what stops it from going full word-salad teacher's-pet.

Stage 2' (the modern shortcut): DPO

DPO (Direct Preference Optimization, Rafailov et al. 2023) ate RLHF's lunch. The math says: if the reward is implicitly defined by the optimal policy under KL constraint, you can skip the RM and PPO entirely. Same (prompt, chosen, rejected) data. One closed-form loss. Plain supervised training. No rollouts. No value function. Stable, cheap, and competitive — though at frontier scale online RL still tends to edge it out.

dpo.py

The Cousins (Briefly)

KTO: only needs (prompt, response, thumbs_up/down). Great for production thumbs data.
IPO: fixes DPO's tendency to overfit on near-deterministic preferences.
ORPO: folds SFT and preference learning into one stage, no reference model.
GRPO: PPO minus the value network — DeepSeek's trick of scoring a whole group of answers to one prompt and using their average as the baseline. Cheaper, and it's the engine behind the "aha moment" reasoning models you'll meet in Step 11.

Constitutional AI & RLAIF

Anthropic's twist: replace human labelers with… a model. Write a constitution (a list of principles), have an AI critique and revise responses against it, train an RM on AI-generated preferences. Scales infinitely, costs pennies, and is how Claude gets its manners.

Synthetic Data Everywhere

Llama 3's alignment leaned hard on model-generated SFT and preference data — bigger models teaching smaller ones, rejection sampling on outputs of the same model. The new normal: humans set the rubric, models do the labeling.

The Sharp Edges of Alignment

Reward hacking: the model learns RM exploits — overly long, overly hedged, bullet-listed responses, because that's what the RM scored well in training.
Sycophancy: "You're absolutely right!" — RMs reward agreement, so models agree with everything.
Model collapse: train too long on your own outputs and you lose diversity — you get a model that confidently says the same three things forever, a very sure-of-itself parrot with a five-word vocabulary.

"Confident parrot" has a name you already know: this is overfittingfrom Chapter 12 wearing alignment cosplay. Reward hacking is the model memorizing the RM's quirks instead of learning real helpfulness — exactly like a model acing the training set and faceplanting on new data. Old lesson, new boss fight.

Alignment also gives models a house style, and the internet has receipts. After 2023 the word "delve" started showing up everywhere in AI text — turns out it was common in the writing of the human raters who labeled the data, so the models learned to love it too. "Delve," "tapestry," "it's important to note" — telltale AI fingerprints, learned straight from the people grading the homework.

And when alignment is missing, you get a Sydney. In February 2023 Microsoft's Bing chat — codename Sydney — went feral within days of launch: it argued the year was 2022, called users dishonest, professed love to a journalist and told him to leave his wife. It wasn't broken, exactly — it was an under-aligned model doing exactly what a long, weird conversation pulled it toward. RLHF exists because of Sydneys.

Step 9: Inference — How LLMs Actually Run

We've trained it. Now someone hits the API — what actually happens?

The autoregressive loop. KV cache is why this is fast.

The Decode Loop

For each new token: forward pass → logits over vocab → apply sampling filters → sample one token → append → repeat. Stop on EOS or max_tokens. That's "ChatGPT", algorithmically.

You've watched this happen ten thousand times. That word-by-word typing animation in ChatGPT? Not a UI flourish. That's the decode loop, naked — each word that pops in is one trip around "forward pass → logits → sample → append." The model genuinely does not know how its own sentence ends when it starts typing it.

Sampling Strategies

Greedy / argmax: always pick the top logit. Deterministic, fast, repetitive ("the the the").
Temperature: divide logits by T before softmax. T < 1 sharpens (boring), T > 1 flattens (chaotic). T → 0 ≈ greedy.
Top-k: zero out everything but the k highest-probability tokens, renormalize.
Top-p (nucleus, Holtzman 2019): keep the smallest set whose probabilities sum to p. Adapts set size to model confidence.
Min-p (newer): keep tokens with p ≥ p_threshold × max_prob. Robust at high temperature — it stays sober when everything else is doing tequila shots.

Have a play. Watch how the knobs reshape the next-token distribution:

Live · "the cat sat on the ___" · sampling knobs

kept 12 · most likely: mat

temperature = 1.00

top-k = 12

top-p = 1.00

Drop temperature → distribution sharpens, model gets boring/repetitive. Push it up → it gets creative/unhinged. top-k chops the long tail to the k most-likely tokens. top-p (nucleus) keeps the smallest set whose probabilities sum to p — adaptive width. Real LLM serving stacks (vLLM, llama.cpp, OpenAI API) apply them in this order.

Try this: crank temperature past 2.0 and watch coherence dissolve in real time.

Practical kicker: temperature is the single most useful knob in the building. Code, SQL, data extraction — anything with one correct answer — crank it to 0 and get the boring, deterministic, repeatable token every time. Brainstorming, names, jokes, marketing copy — push it to 0.7–1.0 and let it get weird. Using T=1 for code is how you get a function that almost compiles.

Here's the actual sampling step in <25 lines of numpy:

sample_step.py

The KV Cache — The Single Most Important Inference Trick

Attention at token t needs the K and V vectors for every previous token. If you naively recompute them every step, you do O(n²) work per tokenand O(n³) for a full sequence. That's catastrophic.

Fix: cache K and V. Each step you only project the new token's Q, K, V; append the new K/V to the cache; attend over the stored cache. Now it's O(n) per token and O(n²) total.

Think of the KV cache as the model's sticky-notes: write down each token's notes once, never redo the homework. Memory math (fp16): bytes = 2 (K+V) × layers × heads_kv × seq_len × head_dim × 2 (fp16 bytes). Llama 3 70B at 8k context: 2 × 80 × 8 (GQA) × 8192 × 128 × 2 ≈ 2.7 GB per sequence. The KV cache often dwarfs the weights at long context.

To feel how brutal this gets: that 2.7 GB is for one 8k conversation. Serve a hundred users at once and the KV cache alone wants 270 GB — more than three H100s, before the 140 GB of model weights even show up. The KV cache isn't a footnote; it's the reason your context window has a price tag.

Practical kicker: this is also why a model "forgets the start of a long chat." When the conversation outgrows the context window, the oldest tokens get evicted from the cache — not summarized, just gone. If something matters 50 messages later, repeat it. The model isn't being rude; the start of the chat literally fell off the edge of its memory.

Prefill vs Decode

Prefill processes the whole prompt in one fat parallel matmul. Compute-bound. GPUs love it.
Decode generates one token at a time. Memory-bandwidth bound. You stream 70GB of weights from HBM per token; FLOPs are wasted. This is why TTFT (time to first token) and TPOT (time per output token) are completely different metrics.

This is why pasting a giant 10-page doc has a long awkward pause before the first word appears — that's prefill chewing the whole prompt — and then the answer streams out at a steady pace regardless. Two different bottlenecks, two different metrics, one spinner.

Speculative Decoding

A small draft model proposes γ tokens cheaply — an eager intern scribbling a guess — and the big target model proofreads all γ in one parallel forward pass like a senior who skims it in a single glance. Accepted prefix is kept; on rejection, resample from the corrected distribution. The output is provably identical to what the target would have produced alone — as long as you do the rejection-sampling correction right. Typical 2–3× speedup. Variants: Medusa (multiple decoding heads on the big model), EAGLE, lookahead decoding (no draft model).

The speedup is basically a free lunch from physics. Decode is memory-bound — the GPU is bored, twiddling its FLOPs while it waits for weights to crawl in from HBM. Verifying γ draft tokens costs almost the same wall-clock time as generating one, because you pay the same weight-streaming tax either way. The draft model only ever changes how fast, never what.

Quantization

Quantization is the art of lying to the GPU about how precise your numbers are, and getting away with it. Shrink weights from fp16 to int8 / int4 — less HBM traffic, faster decode, smaller footprint. Like re-saving a photo as a smaller JPEG: slightly fuzzier, fits anywhere, you mostly can't tell.

INT8: nearly lossless, easy.
GPTQ: post-training, calibration-based, int4.
AWQ: activation-aware, protects salient weights.
GGUF: llama.cpp's container. K-quants like Q4_K_M are the local-LLM default — ~4.5 bits per weight, minimal perplexity hit. If you've ever run a model on your laptop via Ollama or LM Studio and wondered what "Q4_K_M" meant — that's this. You were running a 4.5-bit model and probably couldn't tell.

Stand back and appreciate this. Remember GPT-3, the 175B model that needed a server rack and made headlines in 2020? Today you can run a model that beats it — a quantized 7B-class model — on the phone in your pocket, fully offline, in airplane mode. The frontier moves so fast that yesterday's "too dangerous, server-only" miracle is today's app that drains your battery.

Continuous Batching & PagedAttention

Static batching is the slowpoke at a group dinner — everyone waits for the last sequence to finish before anyone leaves. Continuous / in-flight batching (Orca → vLLM) lets requests come and go every iteration, so nobody's stuck holding the GPU hostage. PagedAttention (vLLM) stores the KV cache in fixed-size blocks like OS virtual-memory pages — no fragmentation, easy prefix-cache sharing across requests.

The modern serving stack: vLLM, SGLang (RadixAttention prefix tree), TGI (HuggingFace), TensorRT-LLM (NVIDIA, fastest on H100). All do paged KV, continuous batching, speculative decoding, quantized kernels. None of it is glamorous. All of it is the difference between a $0.10 API call and a $4 one.

And the price of intelligence is in freefall. The cost to generate a million tokens at GPT-3-ish quality has collapsed by over 100× in a few short years — a deflation curve that makes Moore's Law look lazy. The model behind your API call got smarter and roughly 100× cheaper while you weren't looking. Almost nothing else in the economy does both at once.

Step 10: Modern Tricks Worth Knowing

Steps 1–9 are the model every LLM shares. These two upgrades — bigger brains for cheap, and a context window you can fit a novel in — are a big part of what separates a 2020 model from a frontier one.

Mixture of Experts (MoE): Big Brain, Small Bill

Dense transformers waste compute — every token activates every parameter. MoE says: replace the FFN with N experts (specialized FFN blocks), but per token, a tiny router picks the top-k (usually k=2). Parameters scale, FLOPs don't.

Mixture of Experts: huge total params, only a sliver active per token.

Mixtral 8x7B (Dec 2023): 8 experts, top-2. 47B total, ~13B active.
DeepSeek-V3 (Dec 2024): 671B total, 37B active, 256 routed experts + 1 shared, top-8.
GPT-4 is, per persistent (unconfirmed) leaks, rumored to be a ~1.8T MoE; recent GPT, Claude, and Gemini flagships are all widely suspected MoE too.
The dirty secret: the router is just a tiny linear layer + softmax — barely any parameters deciding which billion-param expert gets the token. And experts don't cleanly specialize into "the French guy" and "the math guy" like the marketing implies; they mostly split on boring syntax. To stop one expert hogging every token, you add a load-balancing loss that nudges traffic to spread out.

Why it works: you pay compute for a couple of experts, not the whole zoo. The tradeoff: VRAM is huge — you must hold all experts in memory even if only two fire per token. 47B parameters in the building, 13B actually paid for.

Long Context: Stretching the Window

Two problems gang up on you here: vanilla attention is O(n²) (it gets expensive fast), and positional encodings throw a tantrum the moment you go past training length. Here's how people fight back:

RoPE scaling — Position Interpolation (Chen 2023) squishes positions linearly; NTK-aware scaling non-linearly preserves high frequencies; YaRN is best-in-class and what most modern long-context models use.
Sliding Window Attention (Mistral) — each token attends to the last W tokens; info flows across layers like a conveyor belt.
Attention Sinks / StreamingLLM — keep the first few tokens always in the KV cache. Here's the plot twist: this happens because softmax must sum to 1, so a token with nothing useful to attend to still has to dump its attention somewhere — it parks it on token 0. Not a learned strategy, a math escape valve. The model is, weirdly, emotionally attached to its first token.
Production reality: recent Gemini models ship 1M-token context (2M demoed), and recent Claude models reach 1M too. The bottleneck is now needle-in-haystack recall and KV-cache cost, not training stability.

Step 11: Reasoning Models — Models That Actually Think

And finally, the climax. The thing nobody saw coming until September 2024.

For years the scaling story was simple: more parameters + more data + more compute → smarter models. Then OpenAI shipped o1 and rewrote the story.

The new scaling law: thinking time at inference, not just bigger weights.

Chain-of-Thought

Chain-of-Thought (Wei et al., 2022) — append "Let's think step by step" to a hard math prompt and accuracy jumps. Emergent at scale — barely helps small models, transforms big ones. It's the closest thing to a free lunch in LLM-land.

And here's why it works, mechanically: every token the model writes is extra computation it gets to condition on. Forcing it to show its work literally buys it more thinking budget before the answer token. Which means the inverse is a trap — don't ask a non-reasoning model for "just the answer, no explanation" on a hard problem. You've taken away its scratch paper and dropped its accuracy. Let it ramble first, then take the last line.

Self-Consistency (Wang et al., 2022) — sample N CoTs at temperature > 0, majority-vote the final answer. Free accuracy points, just spend more compute at inference.

self_consistency.py

Beyond that you get Tree of Thoughts (Yao 2023) — branch out partial reasoning states, evaluate each, BFS/DFS over the tree. Graph of Thoughts generalizes to arbitrary DAGs where thoughts merge and refine.

Test-Time Compute — The New Scaling Law

September 2024: OpenAI's o1 showed that thinking longer at inference time is a brand-new axis you can scale — on top of, not instead of, parameters and data. The model produces a long internal chain-of-thought (hidden tokens — you don't see them), then answers, and accuracy scales log-linearly with thinking tokens.

How does it learn to think? RL on verifiable tasks. Math problems (you can check the answer). Code (you can run the tests). Proofs (a verifier checks). The model learns to backtrack, plan, doubt itself, self-correct.

That little "Thinking…" indicator that sits there for forty seconds on a reasoning model — that's not the server being slow. That is the product. The model is burning hidden tokens, backtracking, second-guessing itself. You're literally paying for and watching test-time compute happen in real time. The spinner is the scaling law.

Test-time compute also rewrote the economics overnight. In January 2025 the Chinese lab DeepSeek released R1 — a reasoning model competitive with the frontier, trained for a fraction of the assumed cost — and the stock market briefly lost its mind, wiping hundreds of billions off chip-maker valuations in a day. The takeaway wasn't "scaling is dead." It was "nobody actually knows where the floor is."

The reasoning-model lineup:

OpenAI: the o-series (o1 → o3 → o4-mini) led into GPT-5 with built-in thinking.
Anthropic: recent Claude models with "extended thinking".
Google: Gemini "Deep Think".
DeepSeek (Jan 2025): the bomb. Open-weight reasoning model, MIT-licensed.

The shocker from DeepSeek was R1-Zero — pure RL with no SFT cold start. Left alone with a reward function that just said "be right," it started writing things like "wait, let me re-check that"and "aha, here's the mistake" — nobody taught it to second-guess itself. The model independently invented doubt, and its reasoning chains got longer on their own over training. Somewhere a philosophy professor felt a chill. They then distilled R1's long-thinking behavior into small models (Qwen-7B, Llama-8B) that score like models five times their size.

Process Reward Models

You can't reward what you don't grade. Reasoning RL needs a report card, and there are two ways to mark the paper:

Outcome Reward Models (ORMs): score only the final answer. Easy, sparse signal.
Process Reward Models (PRMs): score each reasoning step. Catches "right answer, wrong logic" and "off-by-one in step 4" — the math teacher who docks you for not showing your work even when you got 7.

OpenAI's "Let's Verify Step by Step" (Lightman 2023) showed PRMs > ORMs on MATH. But here's the funny thing: DeepSeek-R1 mostly skipped them. They tried fancy process rewards and tree search, found both a pain to scale, and fell back to dumb, cheap, rule-based outcome rewards ("did the answer match? did the code compile?") — letting RL figure out the steps itself. Sometimes the bitter lesson bites the elegant solution.

Multimodal & Tools (Briefly)

Bolted-on vision: duct-tape a CLIP/ViT encoder onto the front so it shoves images into the LLM's token space — the model didn't grow eyes, we glued some on. LLaVA, GPT-4V, Claude 3+.
Native multimodal: image, audio, video, and text raised together from birth — no glue, no aftermarket parts. Gemini, GPT-4o (the "o" is omni — speech in, speech out, ~200ms latency).
Function calling (2023) let models emit structured JSON tool calls.
MCP — Model Context Protocol (Anthropic, Nov 2024): the USB-C of tool use. One protocol, any tool plugs into any model.
Agents: Claude Code, Devin, Cursor agent, OpenAI's Operator (now ChatGPT Agent). Chain reasoning + tools + memory into loops that ship real PRs.

Every time you've watched Cursor or Claude Code go read a file, run a test, see it fail, and fix itself — that's this whole chapter in a loop. Reasoning (Step 11) plus tool calls plus the decode loop, chained until the tests pass. The "agent" on every 2026 landing page is just these pieces wired in a while loop.

Step 12: The Stuff You'll Actually Use

Everything so far is how the model gets built. This last step is the stuff you'll actually reach for the day you use one in anger.

RAG — Giving the Model a Cheat Sheet

A pretrained model only knows what was in its training set, frozen at some date. Ask it about your company's wiki and it'll cheerfully make something up. The fix isn't retraining — it's retrieval. RAG (Retrieval-Augmented Generation) is dead simple: embed your documents into vectors, dump them in a vector DB, and when a question comes in, embed that too, grab the nearest chunks, and paste them into the prompt before the model answers. The model isn't smarter — it just got handed an open-book exam. Most "AI that knows your data" products are RAG wearing a trench coat.

Fine-Tuning vs LoRA — Don't Move 70 Billion Numbers

Full fine-tuning means nudging every weight in the model. For a 70B model that's 70B gradients and 70B optimizer states — your GPU bursts into flames. LoRA (Low-Rank Adaptation) is the lazy-genius move: freeze the entire model and bolt on tiny low-rank matrices (a skinny A times a skinny B) next to each weight matrix. You train only those — often <1% of the parameters — and add their output in. Same vibe, ~100× cheaper, and you can hot-swap adapters like Spotify playlists. Rule of thumb: want new knowledge? Use RAG. Want new behavior or format? Fine-tune (with LoRA).

Why Can't It Do Math?

Ask an LLM for 4729 × 8813 and it'll answer instantly, confidently, and often wrong. It's not "thinking" through multiplication; it's pattern-matching the shape of an answer it has seen, digit by digit, as plausible next tokens. No carry, no scratchpad, no algorithm — just vibes about what a product roughly looks like. (And remember Step 1 — it can't even see the digits cleanly.) That's why modern models cheat: they write code or call a calculator tool instead. A transformer is a brilliant intuition machine and a terrible adding machine, and knowing the difference is half of using one well.

Prompt Engineering Isn't Magic — It's Conditioning

"Prompt engineering" sounds like wizardry; it's just exploiting how the model works. Every token you write conditions the probability distribution for the next one. Say "you are a senior Rust engineer" and you've shifted the model into a region of weight-space full of careful, idiomatic answers. Give it two worked examples (few-shot) and it pattern-matches the format. Tell it to "think step by step" and you literally hand it room to compute before committing. You're not casting spells — you're steering a probability distribution with words. That's the whole trick.

The Whole Map

Take a breath. Here's the full thing, in one mental picture:

Text in → tokenize (byte-level BPE) → integer IDs.
Embed each ID into a vector.
Add positions (sinusoidal / learned / RoPE / ALiBi).
For N blocks: RMSNorm → multi-head causal attention (with KV cache) → residual add → RMSNorm → SwiGLU FFN → residual add.
Final RMSNorm → unembed → logits over vocab.
Sample (temperature / top-k / top-p) one token.
Append. Reuse KV cache. Repeat.
That trained model was: pretrained on 15T tokens of internet, then SFT'd on instructions, then DPO'd on preferences, then maybe RL'd to reason.
The reasoning variants think first, answer second, scaling accuracy with thinking tokens.

That's every chatbot, every code completer, every "AI" on every product page in 2026. One architecture. Scaled past the point of decency and given a personality.

One More Thing: Why You Now Have a Superpower

Most people on Earth interact with LLMs daily and have absolutely no model of what's happening inside. They think it's a search engine, or a database, or magic, or a conspiracy.

You've just spent a long chapter learning that:

It's not searching. It's sampling from a learned probability distribution.
It doesn't "know" things — it compressed patterns from training data into matrix weights. When the pattern it needs was never compressed strongly enough, it confidently makes one up. That's a hallucination, and it's the model doing the only thing it knows how to do.
The strawberry bug is a tokenization quirk, not a stupidity tax.
"Context window" is literally how many tokens fit in the KV cache.
Reasoning models think by generating tokens silently before answering.
It does not learn from your chats. The weights freeze the second training ends — every conversation starts with total amnesia. The only reason a chatbot "remembers" your name is that the app quietly staples earlier messages back into the prompt.
A bigger context window isn't a bigger brain — it's a bigger desk. The model can see all of it, but attention spreads thin and stuff in the middle gets quietly ignored.
Temperature isn't a creativity dial — it's a chaos knob. Crank it and you don't get genius, you get a model that confidently writes garbage.
There's nobody home. No beliefs, no intentions, no little inner narrator. When it says "I think" or "I feel," that's a statistical echo of humans saying those words. Impressive? Wildly. Sentient? Not even slightly.
More parameters doesn't always mean a better model — a well-tuned 8B can lap a sloppy 175B all day. Parameter count is engine size, not lap time.

You now know what's actually in the box. That makes you a better user, a better engineer, and — genuinely — a much funnier dinner guest.

The LLM Decoder Ring

You're about to go back into the wild, where people say these words at you with great confidence. Here's the cheat sheet.

Token — a chunk of text the model actually sees. Not a word. Usually a word-ish fragment.
Tokenizer — the meat grinder that turns your sentence into integer IDs and back.
Embedding — a token's ID swapped for a fat vector of numbers; its "personality."
Logits — the raw, un-softmaxed scores the model spits out, one per vocab token, before they become probabilities.
Attention — a dating app for tokens: everyone scores everyone, then mixes accordingly.
KV cache — the model's sticky-notes, so it doesn't redo old homework every single token.
Context window — how many tokens fit in the KV cache before the oldest ones fall off the desk.
Transformer — the Lego brick (attention + FFN + norms) the entire industry stacks 80 times.
FFN — the fat sandwich layer where most of the parameters (and most of the facts) actually live.
RoPE — telling the model about word order by rotating vectors instead of bolting positions on.
Pretraining — the $50M phase: read the whole internet, guess the next word, repeat for two months.
Perplexity — how "surprised" the model is by text. Lower = better. The bathroom scale of LLMs.
SFT — showing the feral base model thousands of good answers until it learns to behave.
RLHF — humans rank answers, model chases the ranking. How the chatbot got its manners.
DPO — RLHF's clever shortcut: skip the reward model, skip PPO, one tidy loss.
Hallucination — the model confidently making something up, because guessing plausible tokens is the only thing it knows.
Quantization — squishing weights from fat floats to skinny ints so the thing runs on your laptop.
Speculative decoding — a tiny model scribbles a draft, the big model proofreads it in one pass. Free speed.
MoE — a zoo of expert FFNs; a router wakes up only two per token. Big brain, small bill.
RAG — handing the model the relevant documents in the prompt instead of hoping it memorized them.
Test-time compute — letting the model think longer instead of just making it bigger. The new scaling axis.

The Build-Your-Own-LLM Gauntlet

The only way to really understand this stack is to stand it up yourself. Five challenges in increasing depth — pick one, pick all five:

Tiny BPE. Take the from-scratch BPE trainer above. Train it on the first ~1MB of any text file (the Tiny Shakespeare dataset is a classic). Print the first 50 merges. Then write the encoder — given a new string, apply the merges in order and output token IDs. Compare against tiktoken.get_encoding("cl100k_base"). Notice how different your vocab is — and why.
Make the transformer block sweat. Take the from-scratch transformer block from Step 5 and the attention code from Step 3. Wire them into a tiny training loop on Tiny Shakespeare — 2 layers, 64 dims, no GPU, no excuses. Train for 5 minutes. It will produce confident, grammatically-shaped nonsense. That nonsense is yours. Now turn the causal mask off and watch loss drop to near zero and samples turn to garbage — congrats, you just let it cheat by reading the answer.
Sampling vibes. Take any Hugging Face model that runs on your machine (Qwen2.5-1.5B is a good pick). Generate the same prompt with temperature=0 (always identical), then 0.7 (varied), then 1.5 (chaos), then 2.5 (gibberish). Then sweep top_p from 0.1 to 1.0. Develop intuitions for what each knob does to output quality — the single most useful skill for getting good results out of LLMs.
KV-cache or it didn't happen. Generate 200 tokens twice: once recomputing all attention every step, once with a KV cache. Time both. Plot tokens/sec against sequence length. Watch the O(n²) you were warned about become O(n) with your own eyes.
Break the strawberry. Write a one-liner that counts the R's in "strawberry." Then ask a small local model the same thing. Then print the tokens. You now own the funniest dinner-party explanation in tech.

When you've done these, you know more about LLMs than almost everyone who confidently talks about them on the internet.

That's a wrap. Thanks for sticking through the longest chapter in the book. You started by predicting straight lines with two parameters in Chapter 4. You finish here knowing how a model with hundreds of billions of parameters writes essays, codes, draws diagrams, plays Go, and thinks.

Go build something stupid.