Handling Imbalanced Data

In imbalanced classification one class vastly outnumbers the other — fraud, disease, and churn are rare. The trap: a model that always predicts the majority class can score 99% accuracy while catching nothing . The fix is to measure with precision, recall, F1, PR-AUC and ROC-AUC , and to rebalance with resampling, class weights, and threshold tuning.

Learn Handling Imbalanced Data in our free AI & Machine Learning course — a beginner-friendly interactive lesson with worked examples, a practice exercise…

Part of the free AI & Machine Learning course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

This lesson shows why accuracy lies, which metrics to trust, and the practical toolkit — oversampling (SMOTE), undersampling, class_weight='balanced' , threshold tuning, stratified splits, the imbalanced-learn library, and an anomaly-detection framing for extreme imbalance.

When 99% of cases are negative, a model that simply predicts “negative” every time is 99% accurate — and completely useless, because it never finds the rare positives you actually care about. Same data, two stories:

A single number hides total failure on the class that matters.

Precision/recall/F1 immediately reveal the model found nothing.

These are the numbers to report on imbalanced problems. Pick the ones that match the cost of a miss vs a false alarm:

Tip: for very rare positives, PR-AUC is usually more informative than ROC-AUC, because ROC-AUC can stay deceptively high when negatives dominate.

Once you measure honestly, you can fix the imbalance. The main levers, roughly from simplest to most involved:

Run this to see the accuracy trap with your own eyes — 99% accuracy, zero positives caught:

Now compute the metrics that actually expose model quality on rare classes:

When positives are vanishingly rare, standard classification struggles — treating the problem as anomaly detection (learn what “normal” looks like and flag the unusual) is often the stronger framing.

📋 Imbalanced-data checklist

⏱ Test Yourself — Timed Quiz

10 quick questions, 12 seconds each. Instant feedback — beat the clock!

Practice quiz

Why is accuracy misleading when classes are heavily imbalanced?

A model that always predicts the majority class scores very high accuracy while finding nothing
Accuracy cannot be computed on skewed data
Accuracy only works for regression
Accuracy ignores the majority class

Answer: A model that always predicts the majority class scores very high accuracy while finding nothing. If 99% of cases are negative, always predicting 'negative' gives 99% accuracy but catches zero positives — useless for the task.

Which metric focuses on: of the cases we flagged positive, how many really were?

Recall
Accuracy
Precision
R-squared

Answer: Precision. Precision = true positives / all predicted positives — the fraction of your positive flags that were correct.

Which metric focuses on: of all the real positives, how many did we catch?

Specificity
Recall
Precision
Accuracy

Answer: Recall. Recall = true positives / all actual positives — how much of the rare class you successfully found.

The F1 score is the...

Sum of precision and recall
Product of accuracy and recall
Difference of precision and recall
Harmonic mean of precision and recall

Answer: Harmonic mean of precision and recall. F1 is the harmonic mean of precision and recall, balancing the two in a single number for imbalanced problems.

What does SMOTE do?

Creates synthetic new examples of the minority class
Removes features
Deletes majority-class rows at random
Increases the learning rate

Answer: Creates synthetic new examples of the minority class. SMOTE (Synthetic Minority Over-sampling Technique) generates new synthetic minority-class points by interpolating between existing ones.

What does class_weight='balanced' do in many scikit-learn models?

Drops the minority class
Makes errors on the rare class cost more during training
Shuffles the data
Sets accuracy as the loss

Answer: Makes errors on the rare class cost more during training. It automatically up-weights the minority class so the model pays a higher penalty for getting rare cases wrong.

Why use a STRATIFIED train/test split with imbalanced data?

To make training faster
To remove the minority class from the test set
To add more features
To keep the same class proportions in each split

Answer: To keep the same class proportions in each split. Stratified splitting preserves the class ratio in train and test, so a tiny minority class isn't accidentally absent from a split.

Threshold tuning means...

Deleting outliers
Lowering the learning rate
Changing the probability cutoff used to decide the positive class
Retraining with more layers

Answer: Changing the probability cutoff used to decide the positive class. Models output probabilities; moving the decision threshold (e.g. from 0.5 to 0.3) trades precision against recall to fit your needs.

Which Python library is purpose-built for resampling imbalanced datasets?

beautifulsoup
imbalanced-learn
matplotlib
requests

Answer: imbalanced-learn. imbalanced-learn (imported as imblearn) provides SMOTE, random under/over-samplers, and pipelines for imbalanced data.

For EXTREME imbalance (e.g. a few fraud cases in millions), a useful framing is...

Treat it as anomaly / outlier detection
Ignore the rare class
Use accuracy only
Treat it as regression

Answer: Treat it as anomaly / outlier detection. When positives are vanishingly rare, framing it as anomaly detection (model 'normal', flag deviations) often works better than standard classification.