Sigmoid, logistic regression, decision boundaries.
Great, you've taught a machine to draw a straight line. Impressive. You can now predict house prices, exam scores, and other things that live on a continuous number line.
But what about the questions that really matter?
- Is this email spam? (Yes/No)
- Is this credit card transaction fraudulent? (Yes/No)
- Is this a picture of a hot dog? (Hot Dog/Not Hot Dog)
These are classification problems. We don't want a number; we want a decision. A simple "yes" or "no." Let's teach our model how to make a choice.
Why Linear Regression Fails for Classification
Your first instinct might be to just use the linear regression model we just built. Let's say "No" is class 0 and "Yes" is class 1. Why can't we just fit a line to that?
Let's try it. Imagine we're predicting whether a tumor is malignant (1) or benign (0) based on its size. See the problem? Our beautiful line shoots off to infinity in both directions. It can predict a value of 1.8, or -0.4. What does a "malignancy probability" of 180% even mean? Or -40%? It's nonsense. It breaks our 0-or-1 world.
We need a new tool. We need something that takes the output of our linear equation and squishes it into a sensible range: between 0 and 1.
Enter the Sigmoid Function: The OG 'Squishinator'
Meet your new best friend: the Sigmoid function. It's a beautiful, elegant, S-shaped curve that is the absolute hero of binary classification.
The formula is:
Where is just the output of our old friend, the linear equation: .
Look at its properties:
- It takes any real number, from negative infinity to positive infinity.
- It squishes that number into a range between 0 and 1.
- A very large positive input for
zgets mapped close to 1. - A very large negative input for
zgets mapped close to 0. - An input of
z=0gets mapped to exactly 0.5.
This is exactly what we need! We can now interpret the output of the sigmoid function as a probability.
- If our model outputs 0.98, it's 98% sure the answer is "Yes" (Class 1).
- If it outputs 0.05, it's 95% sure the answer is "No" (or 5% sure it's "Yes").
- If it outputs 0.5, it's completely uncertain. Flip a coin.
The sigmoid is like a translator that turns the raw, unbounded "score" from our linear model into a calibrated, understandable probability.
DIY Logistic Regression: It's a Trap!
Now we're going to build a Logistic Regression model from scratch. And here's the secret that confuses everyone: despite its name, Logistic Regression is for CLASSIFICATION, not regression. The name is a historical accident designed to trip up beginners. Don't fall for it.
You're about to have a "wait a minute..." moment, because the code is going to look suspiciously familiar.
Step 1: The predict() function
This is almost the same as before, but we just wrap our linear equation in our new sigmoid function.
Step 2: The loss() function (Binary Cross-Entropy)
We can't use Mean Squared Error anymore. For classification, we need a loss function that punishes the model when it's confidently wrong. We use Binary Cross-Entropy (or Log Loss).
- If the true label is 1, the loss is . Predict 0.99 → loss tiny. Predict 0.01 → loss huge.
- If the true label is 0, the loss is . Same logic in reverse.
This loss function brutally punishes the model for being cocky and wrong, which is exactly what we want.
Step 3: The update() function (Still Gradient Descent!)
Guess what? The update mechanism is still Gradient Descent. The derivative of the Binary Cross-Entropy loss with respect to our parameters m and b turns out to be surprisingly simple.
The update loop looks identical to the one in linear regression. We just swapped out the engine parts (predict and loss functions), but the chassis is the same.
This reveals a fundamental concept in ML: algorithms are often just clever combinations of simpler pieces. We didn't learn a totally new algorithm; we learned a "plugin" (the sigmoid function and a new loss) that adapted our linear model for a new task.
Decision Boundaries: Drawing the Line (Literally)
So what does our trained logistic regression model actually do? It learns a decision boundary. For a 2D problem, this is a line.
The model has learned a line that best separates the 'O's from the 'X's. Any new point that falls on one side of the line is classified as 'X', and any point on the other is classified as 'O'. This is how the model makes its decision. It's not just predicting probabilities; it's literally drawing a line in the sand.