Preprocessing your Data and Why?
Preprocessing plays very important role in Data Science. When are going to feed data to Algorithm there are various things to take care of like normalization, label-encoding, scaling and so on.
A really manual method to remove outliers from your data is as follows:
**Your data should be in between median-2*STD and median+2*STD
STD = Standard Deviation
In the above formula we are considering median as measurement of central tendency by just assuming that it is better than mean but in real life it may not be accurate formula to work on.
In this post we will discuss about various data preprocessing Techniques used in Data-science and Machine-learning those may help to increase the prediction rate, Sometimes you can live without preprocessing but sometimes it is good to do Preprocessing on your data. This ‘Sometime’ depends on your understanding of work you do.
The real Health of your Data describes the real wealth of your Model, That means your data is the most important part that is the main reason it always take much more time like 70-80% while preparing your data for better use, well not really 70-80% If you are Python Ninja 😀 🙂
Let’s Understand Data-Preprocessing in Richard man’s Style.(What is Richard Man’s Style?)
Hack the source!!
Data Scaling or Standardization:
Sigma represents Standard-Deviation and Mu represents Mean.
It is always great to have less biased data with low variance(Spread-out) but Why?
Think of activation function in neural network while performing Forward-propagation, What our activation function does is convert each input between the range of Zero to One(0 to 1) so that would minimize/scale the range of data but in other Algorithms like Regression or Classification we don’t have that automatic-Scaling facility so what we do is apply manual scaling methods.(Too bad I have to write one more function before training my Data :D) One thing also we should remember that ‘Your neural network will be much faster if you feed it with normal data ‘
So decreasing the spread-Out/variance we can achieve better predications because it is easy for system/Algorithm to find patterns into Smaller area. here is small portion of wikkipedia article about feature scaling you might find interesting:
For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.
from sklearn import preprocessing
import numpy as np
frag_mented_array = np.array([2,78,12,-25,22,89])
defrag_array = preprocessing.scale(frag_mented_array)
/home/metal-machine/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by the scale function.
array([-0.67832556, 1.18502658, -0.43314765, -1.34030593, -0.18796973,
**Please never underestimate that warning in real life. Try to look into advance-section of numpy array that will teach you how you can assign ‘types’ to Numpy arrays!
Now the point arises should I scale my Training data or Testing data as well, Well answer is Both. Look out for StandardScaler class in Scikit_learn:
There are some other useful features as well:
MinMaxScaler – Scale data between range (0,1)
MaxAbsScaler – Scale data between range (-1,1)
One question here arises that when I should use Normal Standardization or MinMaxScaler or MaxAbsScalar?
Answer could be really tricky or ‘not in the plain form’, It just depend on the Requirements or your model/ML-Algorithm that you are going to apply on your your data.
We can also understand it this way as well that how decreasing ‘Euclidean distance’ does effect the performance of your model or not.
Normalization is the process of finding unit normals for individual groups,
Remember one thing carefully: Scaled data will always have zero mean and Unit Variance, achieving that is the real cause behind scaling or standardization. Scaling is kind of personal choice that how and at what limits you want to scale your data but when it comes to Normalization you have to figure-out an external standard.
Normalizing can either mean applying a transformation so that you transformed data is roughly normally distributed. Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).
It is really simple and effective process to think of. It converts your data into binary for that is 0 and 1. which means boolean values, In the algorithms those are used in Stock’s-Trading are heavily based on binarization where we use 0 if stock’s predicted price will go down and 1 if price will go up or vice-versa. but important thing to note down is how Binarization is important to make final decisions as well for Algorithm to predict values in the form of 0 and 1.
Removing Biasness from Your Data:
Biased data leads your model towards one form of real universe and your model will only understand that biasness in your data and learn/make predictions only based on that biasness(That is the reason we must use Hypothesis-Testing and Cross-Validation before running our model into production) One thing that comes on my mind as an example of biasness is as follows:
Suppose you picked Apple Price stock from period 01-01-14 ti 01-01-15 and let’s assume in that period Apple’s Stock were going up Everyday/Every-Month/Every-Quarter so after training your model with that particular time period data your model will predict apple’s future price as higher than present price because that became nature of model after learning from Biased data.
This is just an Stupid example to tell readers that how Biased does effect your data.
Survivorship Biased data:
A historical database of stock prices that does not include stocks that have disappeared due to bankruptcies, de-listings, mergers, or acquisitions suffer from the so-called survivorship bias, because only “survivors” of those often unpleasant events remain in the database. (The same term can be applied to mutual fund or hedge fund databases that do not include funds that went out of business.)
Backtesting a strategy using data with survivorship bias can be dan-
gerous because it may inflate the historical performance of the strat-
Disclaimer: This is not full proof post about data preprocessing there are so many other things to know like PCA(Principle component analysis) or Gradient Descent for more understanding of Machine-learning operations while applying on your data.