Performance: vectorization & memory
Performance tuning in pandas is the practice of making your code faster and leaner by replacing slow Python row-loops with vectorized operations that run in compiled C, and by shrinking memory with smaller dtypes and the category type — so the same analysis handles far larger datasets.
Learn Performance: vectorization & memory in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a…
Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.
The single biggest pandas speed lesson: operate on whole columns at once , not row by row. A vectorized expression like df["a"] * df["b"] runs in compiled C over the entire array; a Python loop or iterrows() runs interpreted Python for every row and is typically 10–100× slower.
.apply is still a per-row Python call. For conditional logic, np.where (or .clip , .map , boolean masks) does the same job vectorized.
pandas defaults to wide dtypes (int64, float64, object strings). Downcasting numbers to int32 / int16 and converting low-cardinality text columns to category can slash memory. Measure with df.memory_usage(deep=True) .
Replace the slow loop with a single vectorized expression that applies a 10% discount. Fill in the blank.
Vectorize a calculation and shrink the frame's memory footprint, then confirm the savings.
Lesson complete — your pandas is now fast and lean!
You know to vectorize instead of loop, to skip .apply when a vectorized op exists, and to cut memory with smaller dtypes and category. These habits let you scale from toy data to millions of rows.
🚀 Up next: the Capstone — an end-to-end data cleaning pipeline tying everything together.
Practice quiz
Why is a vectorized expression faster than a Python row-loop?
- It uses more memory
- It runs as compiled C over whole arrays
- It prints less
- It skips rows
Answer: It runs as compiled C over whole arrays. Vectorized ops run in compiled C on entire arrays, avoiding per-row Python overhead.
Roughly how much slower can iterrows be than a vectorized op?
- About the same
- Slightly faster
- 10-100x slower
- Always 2x slower
Answer: 10-100x slower. Looping with iterrows runs interpreted Python per row, typically 10-100x slower.
What is a good vectorized replacement for an if/else .apply?
- np.where(cond, x, y)
- a for loop
- df.iterrows()
- print()
Answer: np.where(cond, x, y). np.where applies conditional logic vectorized across the whole column.
When is .apply acceptable?
- Always, it's fastest
- For every column
- Never use it
- Only when no vectorized equivalent exists
Answer: Only when no vectorized equivalent exists. .apply is essentially a loop; reserve it for cases with no vectorized alternative.
Which dtype best stores a low-cardinality repeated string column?
- category
- object
- float64
- int64
Answer: category. category stores each unique value once plus tiny integer codes, saving lots of memory.
How do you measure a DataFrame's true memory use including object strings?
- len(df)
- df.memory_usage(deep=True)
- df.size
- df.count()
Answer: df.memory_usage(deep=True). deep=True accounts for the actual size of object (string) columns.
When does the category dtype HURT rather than help?
- When values repeat a lot
- When the column is numeric only
- When almost every value is unique (high cardinality)
- Never
Answer: When almost every value is unique (high cardinality). With mostly-unique values, category adds overhead instead of saving memory.
Which downcasts a column to a smaller integer type automatically?
- pd.to_numeric(s, downcast='integer')
- s.astype('object')
- s.apply(int)
- s.round()
Answer: pd.to_numeric(s, downcast='integer'). pd.to_numeric with downcast='integer' picks the smallest integer dtype that fits.
Is %timeit plain Python or a Jupyter magic?
- A pandas method
- A built-in function
- A Jupyter/IPython magic command
- A NumPy function
Answer: A Jupyter/IPython magic command. %timeit is an IPython magic that times a line over many runs; it isn't plain Python.
Which of these string cleanups is vectorized (no apply needed)?
- A for loop over names
- df.iterrows()
- json parsing
- names.str.strip().str.title()
Answer: names.str.strip().str.title(). The .str accessor runs vectorized string operations across the Series.