Spam Filter

2 minute read

On the previous screen, we read in the dataset and saw that about 87% of the messages are ham ("ham" means non-spam), and the remaining 13% are spam. Now that we've become more familiar with the dataset, we can move on to building the spam filter.

Our project is a machine learning problem, specifically a classification problem. The goal of our project is to maximize the predictive ability of our algorithm. This is in contrast to what we would usually do in something like hypothesis testing, where the goal is proper statistical inference.

We want to optimize our algorithm's ability to correctly classify messages that it hasn't seen before. We'll want to create a process by which we can tweak aspects of our algorithm to see what produces the best predictions. The first step we need to take towards this process is divide up our spam data into 3 distinct datasets.

  • A training set, which we'll use to "train" the computer how to classify messages.
  • A cross-validation set, which we'll use to assess how different choices of alpha affect the prediction accuracy
  • A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, 10% for cross-validation and 10% for testing. We typically want to keep as much data as possible for training. The dataset has 3,184 messages, which means that:

  • The training set will have 2,547 messages
  • The cross-validation and test sets will have 318 and 319 messages respectively

We expose the algorithm to examples of spam and ham through the training set. In other words, we develop all of the conditional probabilities and vocabulary from the training set. After this, we need to choose an α value. The cross-validation set will help us choose the best one. Throughout this whole process, we have a set of data that the algorithm never sees: the test set. We hope to maximize the prediction accuracy in the cross-validation set since it is a proxy for how well it will perform in the test set.

Let's create our training, cross-validation and a test sets. If you need help, don't forget that you can find the solutions notebook at this link.

  1. Use the sample() function to generate random indices. The idea here to that we want random rows from spam. We can pick random rows by generating random indices and using these to choose data from spam.

    • Keep in mind the range of the random numbers you want. It won't make sense to generate a random number that is higher than the number of rows in the spam dataset.
    • We don't want duplicate indices either within a dataset or across datasets
    • Make sure you check the documentation for the sample function if you're lost
  2. Use these randomized indices to create your training, cross-validation and test sets.

    • You know you'll have it correct when the training, cross-validation and test datasets together form the original dataset
  3. Find the percentage of spam and ham in all of the datasets. Are the percentages similar to what we have in the full dataset?

Updated: