Handling Imbalanced Data

In imbalanced classification one class vastly outnumbers the other — fraud, disease, and churn are rare. The trap: a model that always predicts the majority class can score 99% accuracy while catching nothing . The fix is to measure with precision, recall, F1, PR-AUC and ROC-AUC , and to rebalance with resampling, class weights, and threshold tuning.

Learn Handling Imbalanced Data in our free AI & Machine Learning course — a beginner-friendly interactive lesson with worked examples, a practice exercise…

Part of the free AI & Machine Learning course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

This lesson shows why accuracy lies, which metrics to trust, and the practical toolkit — oversampling (SMOTE), undersampling, class_weight='balanced' , threshold tuning, stratified splits, the imbalanced-learn library, and an anomaly-detection framing for extreme imbalance.

When 99% of cases are negative, a model that simply predicts “negative” every time is 99% accurate — and completely useless, because it never finds the rare positives you actually care about. Same data, two stories:

A single number hides total failure on the class that matters.

Precision/recall/F1 immediately reveal the model found nothing.

These are the numbers to report on imbalanced problems. Pick the ones that match the cost of a miss vs a false alarm:

Tip: for very rare positives, PR-AUC is usually more informative than ROC-AUC, because ROC-AUC can stay deceptively high when negatives dominate.

Once you measure honestly, you can fix the imbalance. The main levers, roughly from simplest to most involved:

Run this to see the accuracy trap with your own eyes — 99% accuracy, zero positives caught:

Now compute the metrics that actually expose model quality on rare classes:

When positives are vanishingly rare, standard classification struggles — treating the problem as anomaly detection (learn what “normal” looks like and flag the unusual) is often the stronger framing.

📋 Imbalanced-data checklist

⏱ Test Yourself — Timed Quiz

10 quick questions, 12 seconds each. Instant feedback — beat the clock!

Practice quiz

Why is accuracy misleading when classes are heavily imbalanced?

  • A model that always predicts the majority class scores very high accuracy while finding nothing
  • Accuracy cannot be computed on skewed data
  • Accuracy only works for regression
  • Accuracy ignores the majority class

Answer: A model that always predicts the majority class scores very high accuracy while finding nothing. If 99% of cases are negative, always predicting 'negative' gives 99% accuracy but catches zero positives — useless for the task.

Which metric focuses on: of the cases we flagged positive, how many really were?

  • Recall
  • Accuracy
  • Precision
  • R-squared

Answer: Precision. Precision = true positives / all predicted positives — the fraction of your positive flags that were correct.

Which metric focuses on: of all the real positives, how many did we catch?

  • Specificity
  • Recall
  • Precision
  • Accuracy

Answer: Recall. Recall = true positives / all actual positives — how much of the rare class you successfully found.

The F1 score is the...

  • Sum of precision and recall
  • Product of accuracy and recall
  • Difference of precision and recall
  • Harmonic mean of precision and recall

Answer: Harmonic mean of precision and recall. F1 is the harmonic mean of precision and recall, balancing the two in a single number for imbalanced problems.

What does SMOTE do?

  • Creates synthetic new examples of the minority class
  • Removes features
  • Deletes majority-class rows at random
  • Increases the learning rate

Answer: Creates synthetic new examples of the minority class. SMOTE (Synthetic Minority Over-sampling Technique) generates new synthetic minority-class points by interpolating between existing ones.

What does class_weight='balanced' do in many scikit-learn models?

  • Drops the minority class
  • Makes errors on the rare class cost more during training
  • Shuffles the data
  • Sets accuracy as the loss

Answer: Makes errors on the rare class cost more during training. It automatically up-weights the minority class so the model pays a higher penalty for getting rare cases wrong.

Why use a STRATIFIED train/test split with imbalanced data?

  • To make training faster
  • To remove the minority class from the test set
  • To add more features
  • To keep the same class proportions in each split

Answer: To keep the same class proportions in each split. Stratified splitting preserves the class ratio in train and test, so a tiny minority class isn't accidentally absent from a split.

Threshold tuning means...

  • Deleting outliers
  • Lowering the learning rate
  • Changing the probability cutoff used to decide the positive class
  • Retraining with more layers

Answer: Changing the probability cutoff used to decide the positive class. Models output probabilities; moving the decision threshold (e.g. from 0.5 to 0.3) trades precision against recall to fit your needs.

Which Python library is purpose-built for resampling imbalanced datasets?

  • beautifulsoup
  • imbalanced-learn
  • matplotlib
  • requests

Answer: imbalanced-learn. imbalanced-learn (imported as imblearn) provides SMOTE, random under/over-samplers, and pipelines for imbalanced data.

For EXTREME imbalance (e.g. a few fraud cases in millions), a useful framing is...

  • Treat it as anomaly / outlier detection
  • Ignore the rare class
  • Use accuracy only
  • Treat it as regression

Answer: Treat it as anomaly / outlier detection. When positives are vanishingly rare, framing it as anomaly detection (model 'normal', flag deviations) often works better than standard classification.