Part 3 · Chapter 12

When Your Model Screws Up

Overfitting, underfitting, cross-validation, precision vs recall.

So, you've built a model. You fed it data, you watched the loss curve go down, and you got a prediction. You might have even calculated its accuracy and seen a glorious 99% printed to your console.

Time to deploy to production and become a billionaire, right?

Wrong.

Welcome to the most important chapter in this book. This is where we talk about debugging. Because your model might have 99% accuracy and still be complete and utter garbage. And when your model screws up, there's one person to blame: you.

Overfitting vs. Underfitting: The Student Analogy

The most common failure modes for a model are overfitting and underfitting. They represent two sides of the same coin: the model's ability to generalize from the training data to new, unseen data.

Underfitting (High Bias): The Lazy Student

The Symptom: The model performs poorly on the training data and the test data.
The Analogy: The student who didn't study at all. Fails the practice questions and fails the final exam.
The Cause: Model too simple. Using a linear model on a curvy relationship.
The Fix: Use a more complex model. Polynomial, deeper tree, neural net.

Overfitting (High Variance): The Memorizing Nerd

The Symptom: Model performs perfectly on training data but falls apart on test data.
The Analogy: The student who memorized exact answers to study-guide questions; lost when the real exam asks something slightly different.
The Cause: Model is too complex. It learned the noise in the training data, not the signal.
The Fix: Simplify the model. Regularize, prune, get more training data.

live · increase the degree to see overfitting take over

degree 2 · train MSE 0.066

At degree 1 you underfit. At degree 9 the curve thrashes through every point — train loss is tiny, but it will fail on anything new.

The model fits the training data perfectly. But it won't work for any other data point because it has learned the noise, not the simple underlying curve. It has failed to generalize.

The Bias-Variance Tradeoff: The Goldilocks Problem

These two problems are fundamentally linked. This is the Bias-Variance Tradeoff, the central tension in supervised learning.

Bias is the error from a model being too simple (underfitting).
Variance is the error from a model being too sensitive to the training data (overfitting).

You can't have your cake and eat it too.

If you decrease bias (make your model more complex), you almost always increase variance.
If you decrease variance (simplify your model), you almost always increase bias.

The goal is not to eliminate one or the other, but to find the "Goldilocks" spot in the middle—a model that is complex enough to capture the true signal, but not so complex that it starts memorizing the noise.

Cross-Validation: Stop Lying to Yourself

How do you find this sweet spot? Your simple train-test split is a good start, but what if you just got lucky (or unlucky) with your split?

K-Fold Cross-Validation is the professional's tool for getting a more robust and honest evaluation of model performance.

Here's how it works (k=5 is a common choice):

Shuffle your dataset randomly.
Split it into k equal-sized folds (e.g., 5 folds of 20% each).
Now, you run 5 experiments:
- Run 1: Train on Folds 1–4, test on Fold 5.
- Run 2: Train on Folds 1, 2, 3, 5; test on Fold 4.
- Run 3: Train on Folds 1, 2, 4, 5; test on Fold 3.
- ...and so on.

You end up with 5 different performance scores. The average is your cross-validated performance.

This is like giving your student 5 different versions of the final exam and averaging their scores. Much more reliable.

cv_demo.py

Metrics That Actually Matter: Accuracy is for Amateurs

Okay, here's the biggest trap for beginners. You build a model, and you see: Accuracy: 99.0%. You think you're a genius.

But what if you're building a model to detect a rare disease that only affects 1% of the population? A lazy model that just predicts "No Disease" for every single person will be 99% accurate. And it will be 100% useless.

This is why accuracy can be a terrible metric, especially for imbalanced datasets. We need smarter metrics.

Analogy: The Fishing Net

Imagine your model is a fishing net. The fish are the positive cases you want to find (e.g., "spam," "disease"). The rocks and seaweed are the negative cases.

Precision: "Of all the stuff in your net, how much is actually fish?"
Precision = TP / (TP + FP)
High precision = model is trustworthy when it says "yes."
When it matters: Spam filtering. False positives (real email → spam) are much worse than false negatives.
Recall (Sensitivity): "Of all the fish in the lake, how many did you catch?"
Recall = TP / (TP + FN)
High recall = model finds all the positive cases.
When it matters: Medical diagnosis. Missing a real disease is catastrophic.
F1-Score: Harmonic mean of precision and recall.
F1 = 2 × P × R / (P + R)
Best default metric for many problems — especially imbalanced datasets.

Metric	Question it Answers	When to Use It	Real-World Example
Accuracy	What fraction of predictions were correct?	Only for balanced datasets where FP and FN have similar costs.	Cats vs. dogs (balanced classes).
Precision	When my model predicts YES, how often is it correct?	When the cost of a False Positive is high.	Spam detection, investment.
Recall	Of all actual YES cases, how many did my model find?	When the cost of a False Negative is high.	Medical diagnosis, fraud.
F1-Score	How can I balance Precision and Recall?	When you need balance, or for imbalanced datasets.	Most real-world classification.

Stop using accuracy as your only metric. Start thinking about the cost of your model's mistakes. That's how you go from building toys to building tools.