Data Science – Data Preprocessing
This article officially marks my beginning into the fray of machine learning a.k.a. data science. Thanks to Udemy’s Machine Learning A-Z course I have begun a smooth and very enjoyable journey into an area I always though unattainable by me. So I heartily recommend it.
In this article I aim to explain the first step of any machine learning process. The processing of our data to be suitable for our model. To give a brief overview:
- The missing data problem
- Categorical (non numeric) data
- Training and test sets
- Feature scaling
The Missing Data Problem
The missing data problem is a thing that we often encounter in real life. The datasets we use while training doesn’t have that problem often, everything is filled, every column has a data that can be used but in real life we often meet with columns that have no value. What are we going to do then? I would have said, before taking the course I mean, “Get rid of those observations, who cares about it?” But course had shown me my errors and now I know better.
We can do three things with those missing values:
- Remove them (really not recommended, but still an option)
- Fill them with average value
- Fill them with mean value
I think filling them with an mean value is more applicable for our project because in the final analysis we aim to teach an artificial intelligence to predict a dependent value using the independent value pool. And if we fill the missing values with an average value that won’t affect our predictions in the long run.
How do we do it? We can open excel and calculate the average and fill those values by hand but if our dataset consists from a million rows how do we go about it? Luckily for us Python’s sklearn library offers a quick solution to that problem. You use it like this:
In this example x is our independent value list filtered from our dataset. Imputer fills those empty pesky NaN values with mean average with impunity. Let me explain the code a little bit. First of all we import the Imputer class from sklearn.preprocessing library and instantiate it to imputer variable. The first parameter “missing_values” dictates its focus to NaN values, strategy is how to fill them and axis is the column (0) if we say 1 here it will do it with row but this is not so logical for our example.
Then we fit those values to our dataset and then transform it. We assume that our missing values are in the second and third rows the first parameter, which is :, means “take all the rows”, the second parameter which is a slice says “take only the second and third rows”. Because slice is not inclusive for the final index we enter this selects second and third rows.
By doing this we fill our NaN values with their respective column mean value. Yay!
Encoding Categorical Data
Our second problem is the data having some non-numeric values, say country names or product names. We need to convert them into numerical data. Python offers a solution to that as well, let’s look at it and discuss some of this method’s problems:
Again our sklearn library helps us to rescue! This time, however we use LabelEncoder class. This class takes the categorical data in the first column’s all rows, which is denoted via x [:,0] and transforms them into numerical value. Say, our first column has those values for example: Laptop, Television and Phone. This transforming will convert them into 0,1,2 and replaces them with those values.
But here is a problem. How do we prevent our machine learning model that they are not valued by importance. In our model they are just labels and have no importance. So what we need is a matrix like for laptop it should say [1,0,0] for Television [0,1,0] and for Phone it should state [0,0,1]. They are called as dummy values by the way.
Have no fear Python is here:
Again, sklearn to rescue. This time we have one hot encoder. This baby will turn those values to dummy values I’ve stated. By passing categorical_features  we mean that our category is is on the first column. First column of what it would say? We answer that question in the fifth line. We say x, our independent value pool and transform them into arrays. Easy peasy lemon squeezy.
Training and Test Sets
For our baby artificial intelligence to learn the connections between the independent and dependent variables we need two sets of data. First of them is the training set. This set will be used to teach our machine to discern the connections and then we use the test set to test how accurate it is. When we accomplish that we can use our model to predict unknown dependent variables based on the independent value pool. The better it learns from the training set, the better it will predict.
The usual split used here is 80/20. To be specific, we use eighty percent of our dataset as a training set and we use twenty percent to test its accuracy. How do we do that? If you said sklearn, you got a golden star:
This code is fantastic. train_test_split takes some arguments. X is our independent value pool, y is our dependent value pool, test_size is the how much data we will spare for the test_set and random_state is the seed value. So if you are working with a friend and want to have the same training and test sets this value must be equal in your code. This will return a tuple to us which are assigned to four values as you are seeing.
So far, so good. This is a part I don’t get that well so be wary. The problem we are facing is because some independent value columns are bigger than their counterparts (like age and salary) we need to scale them so they don’t dominate the results. To do this we apply feature scaling:
With no surprise we use sklearn again. To do this we use StandardScaler class. This calculates the values and transforms the values as scaled like -1.9993 or 0.3332. This makes them closer to each other and enables us to calculate much faster.
After we have done those steps our data is ready to be processed by our model.
See you later!
Image taken from http://respondr.io