Part 2 · Chapter 06

Decision Trees: The Judgmental Algorithm

A flowchart that learns. Gini, entropy, information gain.

So far, the models we've built have been a bit... abstract. They find the best-fit line or the best-separating plane by fiddling with weights and biases. They give us an answer, but the reasoning is hidden inside a mathematical formula.

Now, let's build a model you can argue with. A model that's as transparent as it is judgmental.

Meet the Decision Tree.

Imagine an algorithm that makes decisions like a paranoid, hyper-specific auntie asking 20 questions before letting you borrow her car. "Is the weather nice? Are you going on the highway? Did you check the tire pressure? Is your friend Chad going with you? I don't like Chad."

That's a Decision Tree. It's basically a flowchart on steroids, and it's one of the most intuitive and interpretable models in all of machine learning.

A Flowchart That Learns

Remember Chapter 1? We said a Decision Tree is just a flowchart that an algorithm learns from data. That's literally it.

A simple decision tree for classifying Iris flowers.

Each internal node is a question about a feature (e.g., "Is age < 30?").

Each branch is an answer to that question (e.g., "Yes" or "No").

Each leaf node is a final decision or prediction (e.g., "Class = 'Buys the Product'").

The genius of the algorithm is that it figures out the best questions to ask, and in what order, to make the most accurate predictions.

The Big Question: What's the Best Question?

How does the tree decide to ask "Is age < 30?" instead of "Is income > $50k?" It needs a way to measure how "good" a question is. A good question is one that splits a mixed group of data into purer, more organized subgroups.

We measure this purity (or lack thereof) with metrics like Gini Impurity or Entropy.

Analogy: The Fruit Basket of Chaos

Imagine you have a basket of 10 fruits.

Scenario 1: Perfect Purity. The basket has 10 apples. If you reach in, you're 100% certain you'll grab an apple. The chaos is zero.
- Gini Impurity = 0
- Entropy = 0
Scenario 2: Maximum Impurity. The basket has 5 apples and 5 oranges. It's a 50/50 shot. The chaos is at its maximum. You have no idea what you'll get.
- Gini Impurity = 0.5 (for a two-class problem)
- Entropy = 1.0

The Decision Tree algorithm works by trying out every possible split on every feature. For each split, it calculates the impurity of the resulting child nodes. It then chooses the split that results in the biggest decrease in impurity. This decrease is called Information Gain. The tree is greedy; it always picks the split that gives it the most information right now.

Live · move the threshold, watch impurity drop

gini left 0.00 · right 0.00 · gain 0.490

threshold2.80

Gini Impurity: The probability of incorrectly classifying a randomly chosen element if it were randomly labeled according to the distribution of labels in the subset. It's computationally faster than Entropy because it doesn't involve a logarithm. $G = 1 - i = 1 \sum C (p_{i})^{2}$
Entropy: A concept from information theory measuring the level of uncertainty or randomness. $H = - i = 1 \sum C p_{i} lo g_{2} (p_{i})$

In practice, they both do a very similar job. Gini is the default for many libraries (like scikit-learn's CART algorithm) because it's a bit quicker to compute.

From-Scratch Implementation: Building the Tree Recursively

Building a decision tree is a classic example of a recursive algorithm.

find_best_split(): This is the workhorse function. It needs to:
1. Loop through every column (feature) in the data.
2. Loop through every unique value in that column (as a potential split point).
3. For each potential split, divide the data into two groups (left and right).
4. Calculate the Gini Impurity of the split (a weighted average of the impurity of the two child nodes).
5. Keep track of the split that produced the lowest Gini Impurity so far.
Return the best feature and split value.
build_tree(): This is the recursive function.
- It takes a dataset (a node) as input.
- It calls find_best_split() on that data.
- Base Case (Stopping Condition): If a stopping condition is met, it becomes a leaf node and returns a prediction (e.g., the majority class of the data points at that node). Stopping conditions could be:
  - The node is perfectly pure (Gini = 0).
  - The tree has reached a max_depth we defined.
  - The number of samples in the node is below a min_samples_leaf threshold.
- Recursive Step: If no stopping condition is met, it creates a new internal node based on the best split. It then calls build_tree() on the left group and the right group, assigning the results to the left and right branches of the current node.

tree_pseudo.py

Overfitting: When the Tree Becomes a Forest Fire

What happens if you don't set any stopping conditions? The tree will keep splitting and splitting until every single leaf node is perfectly pure. It might even create a leaf node for a single, noisy data point.

This is overfitting, and decision trees are notoriously prone to it.

Analogy: The Over-Specific Student

Imagine a student studying for a history exam.

A good student learns the general patterns: "Revolutions are often caused by economic inequality and new political ideas."
An overfit student memorizes hyper-specific, useless facts: "On October 5th, 1789, a man named Dave in Paris complained about the price of bread while wearing a blue hat."

The overfit student gets 100% on the practice questions that cover Dave, but when the real exam asks a general question about the causes of the French Revolution, they have no idea what to say.

An overfit decision tree is the same. It learns the noise and quirks of the training data perfectly, resulting in a ridiculously complex tree that looks like a spider web. It performs great on the data it has seen, but it fails spectacularly on new, unseen data.

The solution is pruning. This can be:

Pre-pruning (Early Stopping): Don't let the tree grow too deep in the first place. This is what our max_depth and min_samples_leaf hyperparameters do.
Post-pruning: Grow the full tree first, then go back and remove branches that don't provide much information gain.

By limiting the tree's complexity, we force it to learn the general patterns, not the noise. This is a crucial step in building a tree that actually works in the real world. The tree's greatest strength—its ability to create specific rules—is also its greatest weakness. Its interpretability is a direct window into its fragility; a small change in the input data can lead to a completely different set of learned rules. This sensitivity is exactly why more robust methods, like Random Forests (which we'll hint at later), were invented—they build an entire committee of these unstable trees and let them vote, turning a collection of shaky experts into a wise crowd.