Data Preprocessing for Machine Learning

By Harshit Khandelwal, Alibaba Cloud Community Blog author.

Data in the real world can be messy or incomplete, often riddled with errors. Finding insights or solving problems with such data can be extremely tedious and difficult. Therefore, it is important that data is converted into a format that is understandable before it can be processed. This is essentially what data preprocessing is. It is an important piece of the puzzle because it helps in removing various kinds of inconsistencies from the data.

Some ways that data can be handled during preprocessing includes:

  • Data cleansing: Inconsistent, duplicate, or incorrect data is corrected or removed from the data source.
  • Data editing: The data source is edited so to get rid of problems such as data that is redundant, wrong, or even missing.
  • Data integration: Data from different sources are combined into one source. Sources can be a database, a data cube or a flat file.
  • Data wrangling: Also known as data munching. The format of data is changed or mapped, so to develop it from a raw data format to a format that is more suitable for the data analysis to follow.

This tutorial will discuss the preprocessing aspect of the machine learning pipeline, coving the ways in which data in handled, paying special attention to the techniques of standardization, normalization, feature selection and extraction, and data reduction. Last, this tutorial goes over how you can implement these techniques in a simple way.

This is part of a series of blog tutorials that discuss the basics of machine learning.



Feature Selection and Extraction

The techniques of extraction and selection are quite different from each other. Extraction creates new features from other features, whereas selection returns a subset of most relevant features from many features. The selection technique is generally used for scenarios where there are many features, whereas the extraction technique is used when either there are few features. A very important use of feature selection is in text analysis where there many features but very few samples (these samples are the records themselves), and an important use of feature extraction is in image processing.

These techniques are mostly used because they offer the following benefits:

  • Help to simplify the model
  • Require shorter training time
  • Makes for easier analysis

Data Reduction

Despite being thought to have a negative impact on the final results of machine learning algorithms, this is simply not true. The process of data reduction does not remove data, but rather converts the data format from one type to another for better performance. The following are some ways in which data is reduced:

  • Binning: Creates bins (or intervals) from a number of attributes by grouping them according to common criterion.
  • Clustering: Groups values in clusters
  • Aggregation or generalization: Searches, gathers, and presents data in a summarized format.
  • Sampling: Selects representative subset of the data to identify patterns and trends in the larger data set.

How to Use Preprocessing Techniques

Several re-scaling and feature manipulation methods can be used directly by applying methods to each record of the data, but it can be tiring and extremely difficult to apply them manually. So now let’s have a look at how these methods look in action to allow you to have an easier and simpler workflow.

First, let’s start by getting a dataset that’s already present in the cloud. The data used in this example is farmer data:

So the data is as follows:

Next, you can use the normalization node, which is under the category Data Preprocessing:

After applying normalization, the data gets converted as shown below:

So let’s see what kind of results standardization produces:

The results are as follows:

The difference between the results of the above output and the normalization output can be clearly seen. The data is much more regular.

So now let’s move on to one of the very useful types of feature extraction techniques, which is one-hot encoding. What this type of extraction technique does is that it creates a separate column for each type of category of data (this result only works for categorical data). It does this by through binary encoding while creating the new column, meaning that it will put 1 where the value exists and 0 elsewhere.

The data produced is as follows:

Next, you want to select features for the model you want to develop. Below, a linear Regression model is used for generating feature importance (that is, determine which features are most important) and gives values for each column, particularly about how much they contribute towards the accuracy of the model.

The result we got from linear model feature importance is depicted in the histogram below:

Key Takeaways

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.