Spotify popularity, tweet sentiment, mood-based snacks.
Alright, enough theory. You've built the engines, you understand the mechanics, you've stared into the mathematical abyss and didn't blink. Theory's over. Time to build cool shit.
In this chapter, we're taking our beautiful, hand-crafted, from-scratch models out for a spin. We're going to point them at real-world (and slightly ridiculous) problems and make them do something. This is where the rubber meets the road. Or, more accurately, where the predict() function meets the CSV file.
We'll walk through a few mini-projects, covering the essential pipeline: loading data, a bit of cleaning, training our DIY models, and trying to make sense of the results.
Project 1: Predicting Spotify Song Popularity
- The Goal: Can we predict how popular a song will be?
- The Dataset: We'll use a Spotify dataset from Kaggle with audio features for thousands of songs — danceability, energy, loudness, and a popularity score from 0–100.
- Our Weapon of Choice: Popularity is continuous → classic regression. Use the DIY Linear Regression from Chapter 4.
- The Workflow:
- Load Data: pandas to load
spotify_songs.csv. - Feature Selection: Pick intuitive features like
danceabilityandenergy. X = these, y =popularity. - Train the Model: Feed X and y into the gradient-descent loop. It finds the best weights and bias.
- Interpret the Results: Our trained model gives an equation like
popularity = w₁·dance + w₂·energy + b. Ifw₁is large and positive, the model has learned that more danceability correlates with more popularity.
- Load Data: pandas to load
Project 2: Is This Tweet Trash or Fire? (Sentiment Analysis)
- The Goal: Classify a tweet's sentiment as 'positive' or 'negative'.
- The Dataset: A simple Twitter sentiment dataset.
- Our Weapon of Choice: Text classification → DIY Naive Bayes from Chapter 8.
- The Workflow:
- Load Data: CSV of tweets and labels.
- Text Preprocessing:
- Convert all text to lowercase.
- Remove punctuation and URLs.
- Tokenize (split into words).
- Train the Model: Count word frequencies per class, compute priors, likelihoods. Just like the spam filter.
- Test It Out: "this movie was absolutely incredible" → should predict positive. "I would rather watch paint dry" → negative.
Project 3: Recommending Snacks Based on Mood (No, Seriously)
- The Goal: Highly scientific snack recommendation for your current mood.
- The Dataset: We invent this one. ML works on anything.
| Mood | Weather | Time of Day | Snack |
|---|---|---|---|
| Stressed | Rainy | Evening | Chocolate |
| Bored | Sunny | Afternoon | Chips |
| Happy | Sunny | Morning | Fruit |
| Stressed | Sunny | Afternoon | Chocolate |
| Tired | Rainy | Evening | Ice Cream |
| Bored | Cloudy | Evening | Popcorn |
- Our Weapon of Choice: Multi-class classification with categorical features → DIY KNN from Chapter 7.
- The Workflow:
- Data Prep: Encode categories to numbers (stressed=0, bored=1, …).
- "Train" the Model: KNN just memorizes the dataset.
- Make a Prediction: Mood=stressed, weather=cloudy, time=evening → numbers → k=1 → closest is (Stressed, Rainy, Evening) → recommend Chocolate. Science!
Building Your First ML Pipeline
As we do these projects, we'll notice a repeating pattern. We'll formalize it by creating a simple Python class that represents our first ML pipeline:
This exercise demonstrates a crucial real-world concept: the choice of algorithm is driven by the problem you're trying to solve and the kind of data you have. There is no single "best" algorithm. A linear model is great for continuous outputs, Naive Bayes excels at text, and KNN can be surprisingly effective for simple, low-dimensional classification. Moving from knowing how an algorithm works to knowing when to use it is the leap from being a student to being a practitioner.