Support Vector Machines
Learn how an SVM finds the widest possible gap between classes, why only a few "support vector" points matter, and how the kernel trick bends straight lines into curves.
Learn Support Vector Machines in our free AI & Machine Learning course — a beginner-friendly interactive lesson with worked examples, a practice exercise and…
Part of the free AI & Machine Learning course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
Imagine two neighbourhoods you must separate with a road. You could draw the road hugging one side, but the safest choice is the widest street that keeps the most distance from the nearest house on either side. That central road is the SVM's decision boundary , and the empty space on each side is the margin .
The houses closest to the road — the ones that determine how wide it can be — are the support vectors . Move a faraway house and nothing changes; move a support vector and the whole road shifts.
A hyperplane is just the generalisation of a line (2D) or a plane (3D) to any number of dimensions — it's the flat surface that separates the classes. Many hyperplanes can split the same data, so the SVM adds a rule: pick the one with the maximum margin .
The margin is the distance from the hyperplane to the nearest training point of each class. The few points that sit exactly on the edge of that margin are the support vectors . Remarkably, they are the only points that define the boundary — delete the rest and the SVM is unchanged.
Let's make the margin concrete. With one feature, a boundary is just a threshold. The margin is the distance from that threshold to the closest point of either class. Run this and watch which threshold gives the widest gap.
Real data is rarely separable by a straight line. The kernel trick lets a linear SVM draw curved boundaries. A kernel function measures how similar two points are as if they had been mapped into a much higher-dimensional space — without ever building that space.
In practice you call a library. Here's an RBF SVM with scikit-learn's SVC , wrapped in a scaler because SVMs are scale-sensitive. Study it (it isn't runnable in the in-browser sandbox) and notice how the concepts map: kernel="rbf" is the kernel trick, C trades margin width against errors, and gamma sets each point's reach.
The expected output is shown in the comments. n_support_ tells you how many support vectors each class contributed — often just a small fraction of the data.
Two knobs control how an RBF SVM fits. C is the regularisation strength: small C allows a wider margin and tolerates some misclassified points; large C insists on getting the training points right, narrowing the margin and risking overfitting. Gamma sets how far the influence of a single point reaches: small gamma gives smooth, broad boundaries; large gamma makes each point only matter locally, producing wiggly boundaries that can overfit.
Fill in the blanks so margin() returns the distance to the nearest point of either class. The expected output is in the comments so you can self-check.
Decide whether to use a small or large C based on how noisy the data is. Fill in the two blanks so the recommendation matches the rule of thumb.
Now do it with the support faded. Given two 1D classes, scan candidate thresholds and print which one gives the widest margin. Only a comment outline is provided.
These traps catch almost everyone working with SVMs. Watch for them.
An unscaled large-range feature dominates the distance calculation and wrecks the boundary.
Huge C and huge gamma together memorise the training set — perfect train score, poor test score.
SVM training time grows steeply with the number of rows; on millions of rows it can be impractically slow.
✅ Fix: use a linear model or tree ensemble for very large data:
You now understand how an SVM finds the maximum-margin hyperplane , why only the support vectors matter, how the kernel trick bends the boundary, and how C and gamma control the fit.
🚀 Up next: K-Means & Clustering — your first unsupervised algorithm, where there are no labels at all.
Practice quiz
What does a Support Vector Machine try to find when separating two classes?
- The line through the most points
- The hyperplane with the largest margin
- The shortest decision boundary
- The boundary closest to the origin
Answer: The hyperplane with the largest margin. An SVM picks the separating hyperplane that maximises the margin — the gap to the nearest points of each class.
What are 'support vectors' in an SVM?
- All training points
- The points closest to the decision boundary
- Only the misclassified points
- The class centroids
Answer: The points closest to the decision boundary. Support vectors are the points sitting on or inside the margin; they alone define the boundary, so the rest can be removed.
What is the 'margin' in an SVM?
- The training error rate
- The number of features
- The distance between the boundary and the nearest points
- The regularisation penalty
Answer: The distance between the boundary and the nearest points. The margin is the width of the gap between the hyperplane and the closest data points on each side; SVMs maximise it.
What problem does the kernel trick solve?
- It speeds up gradient descent
- It lets a linear SVM separate data that is not linearly separable
- It removes the need for support vectors
- It scales features automatically
Answer: It lets a linear SVM separate data that is not linearly separable. Kernels compute similarities as if data were lifted into a higher dimension, so a linear boundary there is curved in the original space.
Which kernel is the common default for non-linear SVM problems?
- RBF (Gaussian) kernel
- Identity kernel
- Sigmoid-only kernel
- No kernel
Answer: RBF (Gaussian) kernel. The RBF (radial basis function) kernel is the usual go-to default; it can model smooth, curved boundaries with two tunable knobs.
What does the C hyperparameter control in an SVM?
- The kernel type
- The trade-off between a wide margin and misclassifying training points
- The number of support vectors directly
- The learning rate
Answer: The trade-off between a wide margin and misclassifying training points. Small C allows a wider margin with some errors (more regularisation); large C punishes errors harder, risking overfitting.
In an RBF kernel, what does a large gamma value do?
- Makes the influence of each point very local, risking overfitting
- Makes the boundary perfectly linear
- Disables the C parameter
- Forces a hard margin
Answer: Makes the influence of each point very local, risking overfitting. Large gamma means each point only influences a tiny neighbourhood, producing a wiggly boundary that can overfit.
Why should you scale features before training an SVM?
- SVMs ignore feature units anyway
- Distance and dot products are scale-sensitive, so large-range features dominate
- Scaling removes support vectors
- Only tree models need scaling
Answer: Distance and dot products are scale-sensitive, so large-range features dominate. SVMs rely on distances and dot products, so an unscaled large-range feature swamps the others; standardise first.
What is a 'soft margin' SVM?
- One that forbids any misclassification
- One with no support vectors
- One that only uses a polynomial kernel
- One that allows some points inside the margin or on the wrong side
Answer: One that allows some points inside the margin or on the wrong side. A soft margin tolerates a few violations (controlled by C), which is essential for noisy, overlapping real-world data.
When are SVMs often a strong choice?
- Very large datasets with millions of rows
- Small-to-medium datasets with clear margins and many features
- Only for image generation
- Only when no scaling is possible
Answer: Small-to-medium datasets with clear margins and many features. SVMs shine on small-to-medium, high-dimensional data with a clear gap; they scale poorly to very large row counts.