Ham vs Spam: Sorting Spam Email Using Alibaba Cloud’s Platform for AI (PAI)

Image for post
Image for post

By Jeremy Pedersen

Spam!? Spam.

Spam mail once seemed like an unstoppable juggernaut. When I got my first laptop in early 2004, I would get 10 or 20 spam messages a day, and there were always 2 or 3 that made it through my spam filter and into my inbox.

But over the last 20 years, spam has become almost a non-issue. When I log into my Outlook or Gmail accounts, I almost never see spam. In fact, the last time I recall seeing a genuine spam message in my Gmail inbox was more than a decade ago.

So what happened? Did the law finally catch up with spammers? Did they repent and move on to more wholesome activities? Quite the opposite: according to some estimates, spam now makes up 90% of all email sent. So what changed? Where’s all the spam?

It’s all in the numbers

The idea is older than Paul Graham’s essay, but he was the first person to put it into practice in such a simple, effective way.

The Naive Bayes classifier is a Machine Learning technique based on Bayes’ Theorem, an important idea from probability theory and statistics.

Bayes Theorem lets you predict the probability of something, given prior knowledge about conditions related to the thing you want to predict. What does this have to do with spam? Bayes’ Theorem can help you calculate the probability that a message is spam, based on the words contained in the message.

The Naive Bayes approach is a simple one. You start with a collection of known spam and non-spam messages. This is your training data. You count up all the words you find in each of these two sets of mail. You can then assign a score (a probability) to each word, based on how often it appears in spam and non-spam emails.

Now, when you receive new email, you can split the new email into its components (words), and calculate, using Bayes’ Theorem, how likely it is that a message is spam, given that you have seen a particular word.

By summing up the scores for all the words in the email, we get an overall spam probability score that tells us whether or not we should throw the email in our Junk folder.

The amazing thing is just how well this works. Even if you assume that all the words in the email are independent of one another (they aren’t: the English language does have grammatical rules, after all) you can still sort spam with amazing accuracy. Even better, the false positive rate (the rate at which real email is mis-identified as spam) is very low.

The best part about this is that this system can learn. Each spam message that gets through the filter can be marked as spam by you, the user: these new spam messages then get added to the model, so future spam doesn’t get through to your inbox.

So what happened to all the spam? Machine Learning happened! Thanks to Naive Bayes classifiers, most modern email users simply don’t see much spam: most of it is caught by modern spam filters, which are constantly learning to identify even the most sophisticated spam messages.

Building a classifier

The Alibaba Cloud Academy actually has a GitHub page where we explain how to do this using Alibaba Clouid’s Platform for AI (PAI). Specifically, PAI DSW (Data Science Workshop) is used, which is a web console that gives you easy access to a project space with 5 GB of free storage, with a Jupyter Notebook interface that lets you make use of powerful hardware to run models using popular tools like Tensorflow.

All the code and instructions you need to try building your own spam filter on Alibaba Cloud can be found right here.

The steps are simple

  1. Sign up for an Alibaba Cloud account
  2. Download the Jupyter Notebook code and Kaggle Dataset by following these instructions
  3. Enable PAI and set up a PAI DSW notebook (a small notebook instance is fine, no need for a lot of RAM or a GPU)
  4. Upload the Jupyter notebook file (.ipynb file) and email data
  5. Run the code in the notebook

That’s it! Make sure to read through the code, so you can get a feel for what it’s doing. There’s more detail on the GitHub page (see links above) if you are interested in what libraries are used.

Further Learning

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store