Handling Imbalanced Data
In imbalanced classification one class vastly outnumbers the other — fraud, disease, and churn are rare. The trap: a model that always predicts the majority class can score 99% accuracy while catching nothing . The fix is to measure with precision, recall, F1, PR-AUC and ROC-AUC , and to rebalance with resampling, class weights, and threshold tuning.
Learn Handling Imbalanced Data in our free AI & Machine Learning course — a beginner-friendly interactive lesson with worked examples, a practice exercise…
Part of the free AI & Machine Learning course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
This lesson shows why accuracy lies, which metrics to trust, and the practical toolkit — oversampling (SMOTE), undersampling, class_weight='balanced' , threshold tuning, stratified splits, the imbalanced-learn library, and an anomaly-detection framing for extreme imbalance.
When 99% of cases are negative, a model that simply predicts “negative” every time is 99% accurate — and completely useless, because it never finds the rare positives you actually care about. Same data, two stories:
A single number hides total failure on the class that matters.
Precision/recall/F1 immediately reveal the model found nothing.
These are the numbers to report on imbalanced problems. Pick the ones that match the cost of a miss vs a false alarm:
Tip: for very rare positives, PR-AUC is usually more informative than ROC-AUC, because ROC-AUC can stay deceptively high when negatives dominate.
Once you measure honestly, you can fix the imbalance. The main levers, roughly from simplest to most involved:
Run this to see the accuracy trap with your own eyes — 99% accuracy, zero positives caught:
Now compute the metrics that actually expose model quality on rare classes:
When positives are vanishingly rare, standard classification struggles — treating the problem as anomaly detection (learn what “normal” looks like and flag the unusual) is often the stronger framing.
📋 Imbalanced-data checklist
⏱ Test Yourself — Timed Quiz
10 quick questions, 12 seconds each. Instant feedback — beat the clock!
Practice quiz
Why is accuracy misleading when classes are heavily imbalanced?
- A model that always predicts the majority class scores very high accuracy while finding nothing
- Accuracy cannot be computed on skewed data
- Accuracy only works for regression
- Accuracy ignores the majority class
Answer: A model that always predicts the majority class scores very high accuracy while finding nothing. If 99% of cases are negative, always predicting 'negative' gives 99% accuracy but catches zero positives — useless for the task.
Which metric focuses on: of the cases we flagged positive, how many really were?
- Recall
- Accuracy
- Precision
- R-squared
Answer: Precision. Precision = true positives / all predicted positives — the fraction of your positive flags that were correct.
Which metric focuses on: of all the real positives, how many did we catch?
- Specificity
- Recall
- Precision
- Accuracy
Answer: Recall. Recall = true positives / all actual positives — how much of the rare class you successfully found.
The F1 score is the...
- Sum of precision and recall
- Product of accuracy and recall
- Difference of precision and recall
- Harmonic mean of precision and recall
Answer: Harmonic mean of precision and recall. F1 is the harmonic mean of precision and recall, balancing the two in a single number for imbalanced problems.
What does SMOTE do?
- Creates synthetic new examples of the minority class
- Removes features
- Deletes majority-class rows at random
- Increases the learning rate
Answer: Creates synthetic new examples of the minority class. SMOTE (Synthetic Minority Over-sampling Technique) generates new synthetic minority-class points by interpolating between existing ones.
What does class_weight='balanced' do in many scikit-learn models?
- Drops the minority class
- Makes errors on the rare class cost more during training
- Shuffles the data
- Sets accuracy as the loss
Answer: Makes errors on the rare class cost more during training. It automatically up-weights the minority class so the model pays a higher penalty for getting rare cases wrong.
Why use a STRATIFIED train/test split with imbalanced data?
- To make training faster
- To remove the minority class from the test set
- To add more features
- To keep the same class proportions in each split
Answer: To keep the same class proportions in each split. Stratified splitting preserves the class ratio in train and test, so a tiny minority class isn't accidentally absent from a split.
Threshold tuning means...
- Deleting outliers
- Lowering the learning rate
- Changing the probability cutoff used to decide the positive class
- Retraining with more layers
Answer: Changing the probability cutoff used to decide the positive class. Models output probabilities; moving the decision threshold (e.g. from 0.5 to 0.3) trades precision against recall to fit your needs.
Which Python library is purpose-built for resampling imbalanced datasets?
- beautifulsoup
- imbalanced-learn
- matplotlib
- requests
Answer: imbalanced-learn. imbalanced-learn (imported as imblearn) provides SMOTE, random under/over-samplers, and pipelines for imbalanced data.
For EXTREME imbalance (e.g. a few fraud cases in millions), a useful framing is...
- Treat it as anomaly / outlier detection
- Ignore the rare class
- Use accuracy only
- Treat it as regression
Answer: Treat it as anomaly / outlier detection. When positives are vanishingly rare, framing it as anomaly detection (model 'normal', flag deviations) often works better than standard classification.