Part 2 · Chapter 08

Naive Bayes: Trust Issues but Make It Statistical

Build a spam filter from probabilities and bold assumptions.

Alright, let's talk about an algorithm that is simultaneously brilliant, powerful, and built on a foundation of complete and utter delusion.

Meet Naive Bayes.

This algorithm is the workhorse behind many early spam filters. It's a probabilistic classifier that's fast, efficient, and surprisingly effective. And it achieves all this by making one of the most ridiculously bold—and technically wrong—assumptions in all of machine learning. It's like a detective who assumes every suspect in a crime acted completely independently, with no communication or conspiracy whatsoever.

It's a "naive" assumption. But damn, if it doesn't work.

The Core Idea: Bayes' Theorem in Action

We're going back to Chapter 2 and dusting off our old friend, Bayes' Theorem. As a reminder, it's the formula that lets us update our beliefs in the face of new evidence.

P (A ∣ B) = \frac{P ( B ∣ A ) P ( A )}{P ( B )}

In classification, we want to calculate the probability of a certain class (C) given some observed data (D). For a spam filter, this translates to:

P (Spam ∣ Words) = \frac{P ( Words ∣ Spam ) P ( Spam )}{P ( Words )}

Let's break this down with a food allergy analogy:

P(Spam): The prior probability. What's the chance that any random email is spam, before we've even read it? Maybe it's 1 in 5 (20%). This is our starting belief.
P(Words | Spam): The likelihood. Given that an email is spam, what's the probability it contains certain words (like "Viagra," "prince," "free")? We can calculate this from our training data.
P(Words): The evidence. What's the overall probability of seeing these words in any email, spam or not?
P(Spam | Words): The posterior probability. After seeing the words in the email, what is our updated probability that it's spam? This is what we want to calculate.

We calculate the posterior probability for both 'Spam' and 'Ham' (not spam). Whichever is higher is our prediction.

The "Naive" Assumption: The Group Project Mentality

Here's where the magic and the madness come in. Calculating the probability of an entire sentence, like P("free money now"|Spam), is hard because the words aren't independent. The word "money" is more likely to appear if the word "free" is already there.

Naive Bayes says: "LOL, who cares."

It makes the class-conditional independence assumption. It naively assumes that every feature (in this case, every word) is completely independent of every other feature, given the class.

So, instead of a complex calculation, it does this:

P ("free money now" ∣ Spam) \approx P (free ∣ S) \cdot P (money ∣ S) \cdot P (now ∣ S)

This is obviously wrong. In the real world, words have context. "San" is not independent of "Francisco." But this assumption simplifies the math so dramatically that it becomes not just possible, but incredibly fast. It's the ultimate "good enough" engineering tradeoff: sacrifice theoretical purity for practical speed and effectiveness. It's the algorithm equivalent of assuming everyone in a group project will do their part without coordinating—a naive hope, but it simplifies planning.

From-Scratch Spam Filter: Let's Catch Some Scammers

Let's build a text classifier from the ground up.

Step 1: Data Preparation and Training

The "training" phase for Naive Bayes is just counting.

Take a dataset of emails already labeled spam or ham.
Calculate the prior probabilities: P(Spam) and P(Ham).
Create a vocabulary of all unique words in the dataset.
For each word in the vocabulary, calculate its likelihood for each class. For example: $P ("Viagra" ∣ Spam) = \frac{count of "Viagra" in spam}{total words in spam}$
Store all these probabilities in a dictionary or hash map. That's our "trained" model.

Step 2: The Problem of Zero Probability & Laplace Smoothing

What happens if our new email contains a word the model has never seen before? The count for this word would be 0, making its probability 0. When we multiply all the probabilities together, the entire calculation becomes 0, and our model breaks.

To fix this, we use Laplace Smoothing (or add-one smoothing). It's a simple trick: we pretend we've seen every word in our vocabulary one more time than we actually have.

P (word ∣ Class) = \frac{count(word, Class) + 1}{total words in Class + ∣ vocab ∣}

This ensures that no word ever has a zero probability. It's like giving every word a tiny head start.

Step 3: The predict() function

Now, to classify a new, unseen email:

Start with the prior probabilities for 'Spam' and 'Ham'.
For each word in the new email, multiply the running probability by the word's likelihood for that class.

Pro Tip: Since we're multiplying lots of small probabilities, we can run into floating-point underflow. The standard trick is to use the log of the probabilities instead. Multiplication becomes addition, which is numerically much more stable:

lo g P (Class) + i \sum lo g P (word_{i} ∣ Class)

Compare the final log-probability scores for 'Spam' and 'Ham'. The class with the higher score is our prediction.

naive_bayes.py

This approach reveals a powerful engineering principle: sometimes, a model that is technically "wrong" in its assumptions can be more useful in practice than a "correct" model that is too slow or complex to be feasible. Naive Bayes is a triumph of pragmatism over purity, and a testament to the power of a "lazy genius" shortcut.