Balance dataset is very rare to find in machine learning, mostly the data comes in various shape and size.
If the dataset is imbalance it cause havoc to the machine learning models , and gives a misleading accuracy of the model .
In this post will look into various techniques to handle imbalance dataset in python .
Imbalanced Classes & Impact
- Data with skewed class distribution.
- Common examples are spam/ham mails, malicious/normal packets.
- Fraud detection , intrusion detection , cancer cell prediction are few example
- Classification algorithms are prone to predict data with heavier class.
- accuracy score is not the right matrix.
- We got to rely on matrices like confusion matrix, recall, precision
Oversampling and under sampling of data .
The most straightforward methodologies require little change to the preparing steps, and basically include modifying the precedent sets until they are adjusted. Oversampling arbitrarily imitates minority cases to build their populace. Undersampling haphazardly downsamples the larger part class. A few information researchers imagine that oversampling is prevalent in light of the fact that it results in more information, though undersampling discards information. Yet, remember that repeating information isn’t without outcome—since it results in copy information, it causes factors to seem to have lower fluctuation than they do. The positive outcome is that it copies the quantity of blunders: if a classifier makes a bogus negative mistake on the first minority informational index, and that informational collection is imitated multiple times, the classifier will make six mistakes on the new set. Then again, undersampling can make the free factors appear as though they have a higher difference than they do.
Some of the technique with python implementation is represented below :-
SMOTE (Synthetic Minority Oversampling Technique)
- Generates new samples by interpolation
- It doesn’t duplicates data
ADASYN (Adaptive Synthetic Sampling Method)
- Similar to SMOTE, this also generates data.
- Generate samples to the original which are wrongly mis-classified
- Reducing the data of the over-represented class
- The reduced data is picked randomly from the sample & not derived
ClusterCentroid for data generation
- Generating representative data using kmeans
- Centroids of clusters are used
Making learning algorithms aware of class distribution
- Most of the classfication algorithms provides a method to pass class distribution information
- Internally, learning algorithm uses this & configures itself for justifying under represented class
This are few techniques to handle data imbalance problem in machine learning .