Pre-processing is one of the fundamental elements in doing data analysis and it holds the key to a extent of increasing and decreasing model accuracy in this blog post we will see some of the most used technique in Sklearn and where to use it
Lets get our hands dirty :
Pre-processing of data simply transform data into machine learning digestive form . Most of the machine learning algorithm feeds in numeric data . Converting the data is pre processing .
Standard Scaler :
- The StandardScaler assumes your data is normally distributed within each feature and will scale them such that the distribution is now centred around 0, with a standard deviation of 1.
- Calculate – Subtract mean of column & div by standard deviation
- Not suitable to use where data is to normally distributed .
So where would you use Standard Scaler against Min Max :
The answer is as always ‘it depends’ but here are some general guidelines:
For most cases StandardScaler would do no harm. Especially when dealing with variance (PCA, clustering, logistic regression, SVMs, perceptrons, neural networks) in fact Standard Scaler would be very important.
Sk learn implementation:
Min Max Scaler :
- One of the most popular
- Calculate – Subtract min of column & div by difference between max & min
- Data shifts between 0 & 1
- If distribution not suitable for StandardScaler, this scaler works out.
- Sensitive to outliers
Each parameter value is obtained by dividing by magnitude
Normalization refers to rescaling real valued numeric attributes into the range 0 and 1
It is useful to scale the input attributes for a model that relies on the magnitude of values, such as distance measures used in k-nearest neighbours and in the preparation of coefficients in regression.
Thresholding numerical values to binary values ( 0 or 1 )
Preffered using where data assumes to be in Bernoulli’s distribution – Naïve Bayes .
Handling Of Categorical Variable :
Label Encoding – Label encoding is the process of converting categorical variable into single label to transform it to numeric .
Handling missing data causes havoc in machine learning algorithms .
There are techniques to handle missing value with imputers.
- Imputers can used to put mean value , even with regression the values can be handled if required .
This are by far the most used technique s in using pre processing for machine learning technique .