Dimensionality Reduction with PCA
Squash many correlated features into a handful of new axes that keep most of the information. Learn how PCA finds directions of maximum variance and how to read the explained variance ratio.
Learn Dimensionality Reduction with PCA in our free AI & Machine Learning course — a beginner-friendly interactive lesson with worked examples, a practice…
Part of the free AI & Machine Learning course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
To photograph a 3D sculpture as a single 2D photo, you choose the angle that shows the most detail — you'd never shoot it edge-on, where everything collapses into a thin line. PCA does the same with data: it finds the viewing angle (direction) along which the points spread out the most, then flattens onto it while losing as little as possible.
That best angle is the first principal component . The next-best angle, at right angles to it, is the second, and so on.
PCA rests on one idea: the directions where your data spreads out the most carry the most information. A direction where all the points sit on top of each other tells you nothing; a direction where they fan out tells you a lot. That spread is the variance .
So PCA's job is to find new axes — principal components — ordered by how much variance each one captures. PC1 captures the most, PC2 the next-most (and is orthogonal to PC1), and so on. Each component is a linear combination (weighted blend) of your original features.
Let's make it concrete. We project the data onto candidate directions and measure the spread of the resulting 1D shadow. The direction with the most variance is closest to the true first component.
Each component captures a fraction of the data's total variance — its explained variance ratio . These fractions are sorted high to low and they add up to 1.0 across all components. To decide how many components to keep, add them from the top until you hit a target like 90% or 95%.
For example, if PC1 keeps 0.85 and PC2 keeps 0.10, then two components keep 0.85 + 0.10 = 0.95 — 95% of the variance with half the axes you might have started with.
Here's PCA with scikit-learn's PCA , after scaling. Study it (it isn't runnable in the in-browser sandbox). Notice fit_transform turns 4 features into 2 components, and explained_variance_ratio_ tells you how much each one kept.
The scaler comes before PCA in the pipeline — skipping it lets the largest-range feature hijack the first component.
Reach for PCA when you have many features, especially correlated ones . It speeds up training, reduces noise, fights the "curse of dimensionality", and lets you plot high-dimensional data in 2D or 3D.
Fill in the blanks so the helper projects the data onto a direction and returns the variance of the shadow. The expected output is in the comments.
Add explained-variance ratios until you reach a target. Fill in the blank so the function returns the number of components needed for 90% variance.
Given each component's explained variance, print the running cumulative total and how many components reach 95%. Only a comment outline is provided.
These are the most common PCA mistakes. Watch for them.
The largest-range feature dominates the first component for no real reason.
Fitting PCA on all data before splitting leaks test information into training — a form of data leakage.
✅ Fix: fit on train only, transform the test set:
❌ Using PCA when you need interpretable features
Components blend many features, so you can no longer explain a decision in plain terms.
✅ Fix: keep the original features (or use feature selection) when explanation matters:
You now understand how PCA finds the directions of greatest variance , what principal components are, how to read the explained variance ratio , and why you must scale first .
🚀 Up next: Gradient Boosting — chaining weak learners into one of the most accurate models for tabular data.
Practice quiz
What is the main goal of Principal Component Analysis (PCA)?
- Add more features
- Reduce the number of features while keeping most of the variance
- Label the data
- Increase training error
Answer: Reduce the number of features while keeping most of the variance. PCA projects data onto fewer new axes (components) chosen to retain as much of the original variance as possible.
In PCA, what does 'variance' along a direction represent?
- The label of the data
- How much the data spreads out along that direction
- The number of rows
- The learning rate
Answer: How much the data spreads out along that direction. Variance measures spread; PCA seeks directions of maximum spread because that is where the information lives.
What is the first principal component?
- The direction of greatest variance in the data
- A random feature
- The smallest eigenvalue
- Always the original first feature
Answer: The direction of greatest variance in the data. PC1 is the single direction along which the data varies the most; later components capture the next-most variance, each orthogonal.
Are principal components related to the original features?
- They are identical to the original features
- They are new axes, each a linear combination of the original features
- They are the labels
- They are always two of the original features
Answer: They are new axes, each a linear combination of the original features. Each component is a weighted blend (linear combination) of the original features, not one original feature on its own.
What does the 'explained variance ratio' tell you?
- The training time
- The fraction of total variance each component captures
- The number of clusters
- The kernel type
Answer: The fraction of total variance each component captures. It reports what share of the data's total variance each component retains, so you can decide how many to keep.
Why must you scale (standardise) features before PCA?
- PCA ignores scale
- PCA is variance-based, so a large-range feature would dominate the components
- Scaling adds variance
- Only trees need scaling
Answer: PCA is variance-based, so a large-range feature would dominate the components. PCA chases variance, and an unscaled large-range feature has huge raw variance, so it would hijack the first component.
Principal components are orthogonal to each other. This means they are:
- Parallel
- Uncorrelated / at right angles
- Identical
- Always negative
Answer: Uncorrelated / at right angles. Components are mutually orthogonal, so they capture non-overlapping (uncorrelated) directions of variation.
When is PCA most useful?
- When you have very few features already
- When you need exact original features for a regulator
- When data has no variance
- When you have many correlated features and want compression or visualization
Answer: When you have many correlated features and want compression or visualization. PCA shines with many correlated features — it compresses them, speeds models, and enables 2D/3D visualization.
A downside of PCA is that the resulting components are:
- Always perfectly interpretable
- Hard to interpret because they mix many original features
- Guaranteed to improve accuracy
- Identical to the labels
Answer: Hard to interpret because they mix many original features. Because each component blends many features, the new axes lose the clear meaning the original columns had.
If PC1 explains 0.85 of the variance and PC2 explains 0.10, how much is kept by using both?
- 0.10
- 0.85
- 0.95
- 1.00
Answer: 0.95. Explained-variance ratios add up: 0.85 + 0.10 = 0.95, so the first two components retain 95% of the variance.