Ham vs Spam: Sorting Spam Email Using Alibaba Cloud’s Platform for AI (PAI)
By Jeremy Pedersen
We’ve all received spam mail before. These annoying, sometimes dangerous messages can come from a variety of sources: scammers trying to steal your money, unscrupulous companies trying to push their products, even criminals trying to harvest email addresses or coax your computer into running viruses or malware.
Spam mail once seemed like an unstoppable juggernaut. When I got my first laptop in early 2004, I would get 10 or 20 spam messages a day, and there were always 2 or 3 that made it through my spam filter and into my inbox.
But over the last 20 years, spam has become almost a non-issue. When I log into my Outlook or Gmail accounts, I almost never see spam. In fact, the last time I recall seeing a genuine spam message in my Gmail inbox was more than a decade ago.
So what happened? Did the law finally catch up with spammers? Did they repent and move on to more wholesome activities? Quite the opposite: according to some estimates, spam now makes up 90% of all email sent. So what changed? Where’s all the spam?
It’s all in the numbers
In early 2002, Paul Graham published an essay on his personal website, called A Plan for Spam. In it he described how he used a Naive Bayes classifier to detect and block spam.
The idea is older than Paul Graham’s essay, but he was the first person to put it into practice in such a simple, effective way.
The Naive Bayes classifier is a Machine Learning technique based on Bayes’ Theorem, an important idea from probability theory and statistics.
Bayes Theorem lets you predict the probability of something, given prior knowledge about conditions related to the thing you want to predict. What does this have to do with spam? Bayes’ Theorem can help you calculate the probability that a message is spam, based on the words contained in the message.
The Naive Bayes approach is a simple one. You start with a collection of known spam and non-spam messages. This is your training data. You count up all the words you find in each of these two sets of mail. You can then assign a score (a probability) to each word, based on how often it appears in spam and non-spam emails.
Now, when you receive new email, you can split the new email into its components (words), and calculate, using Bayes’ Theorem, how likely it is that a message is spam, given that you have seen a particular word.
By summing up the scores for all the words in the email, we get an overall spam probability score that tells us whether or not we should throw the email in our Junk folder.
The amazing thing is just how well this works. Even if you assume that all the words in the email are independent of one another (they aren’t: the English language does have grammatical rules, after all) you can still sort spam with amazing accuracy. Even better, the false positive rate (the rate at which real email is mis-identified as spam) is very low.
The best part about this is that this system can learn. Each spam message that gets through the filter can be marked as spam by you, the user: these new spam messages then get added to the model, so future spam doesn’t get through to your inbox.
So what happened to all the spam? Machine Learning happened! Thanks to Naive Bayes classifiers, most modern email users simply don’t see much spam: most of it is caught by modern spam filters, which are constantly learning to identify even the most sophisticated spam messages.
Building a classifier
You can try this yourself. Publicly available email datasets are easy to find on sites like Kaggle and most of the email preprocessing (opening the emails, splitting them into word sets, and so on) can be done with existing Python libraries.
The Alibaba Cloud Academy actually has a GitHub page where we explain how to do this using Alibaba Clouid’s Platform for AI (PAI). Specifically, PAI DSW (Data Science Workshop) is used, which is a web console that gives you easy access to a project space with 5 GB of free storage, with a Jupyter Notebook interface that lets you make use of powerful hardware to run models using popular tools like Tensorflow.
All the code and instructions you need to try building your own spam filter on Alibaba Cloud can be found right here.
The steps are simple
- Sign up for an Alibaba Cloud account
- Download the Jupyter Notebook code and Kaggle Dataset by following these instructions
- Enable PAI and set up a PAI DSW notebook (a small notebook instance is fine, no need for a lot of RAM or a GPU)
- Upload the Jupyter notebook file (
.ipynbfile) and email data
- Run the code in the notebook
That’s it! Make sure to read through the code, so you can get a feel for what it’s doing. There’s more detail on the GitHub page (see links above) if you are interested in what libraries are used.
Interested to learn more? Check out these courses on the Alibaba Cloud Academy:
- Machine Learning Algorithm Primer Series 2- Naive Bayes Classifier
- Support Vector Machine Implementation Through PAI