Performance: vectorization & memory

Performance tuning in pandas is the practice of making your code faster and leaner by replacing slow Python row-loops with vectorized operations that run in compiled C, and by shrinking memory with smaller dtypes and the category type — so the same analysis handles far larger datasets.

Learn Performance: vectorization & memory in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice exercise and a…

Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

The single biggest pandas speed lesson: operate on whole columns at once , not row by row. A vectorized expression like df["a"] * df["b"] runs in compiled C over the entire array; a Python loop or iterrows() runs interpreted Python for every row and is typically 10–100× slower.

.apply is still a per-row Python call. For conditional logic, np.where (or .clip , .map , boolean masks) does the same job vectorized.

pandas defaults to wide dtypes (int64, float64, object strings). Downcasting numbers to int32 / int16 and converting low-cardinality text columns to category can slash memory. Measure with df.memory_usage(deep=True) .

Replace the slow loop with a single vectorized expression that applies a 10% discount. Fill in the blank.

Vectorize a calculation and shrink the frame's memory footprint, then confirm the savings.

Lesson complete — your pandas is now fast and lean!

You know to vectorize instead of loop, to skip .apply when a vectorized op exists, and to cut memory with smaller dtypes and category. These habits let you scale from toy data to millions of rows.

🚀 Up next: the Capstone — an end-to-end data cleaning pipeline tying everything together.

Practice quiz

Why is a vectorized expression faster than a Python row-loop?

  • It uses more memory
  • It runs as compiled C over whole arrays
  • It prints less
  • It skips rows

Answer: It runs as compiled C over whole arrays. Vectorized ops run in compiled C on entire arrays, avoiding per-row Python overhead.

Roughly how much slower can iterrows be than a vectorized op?

  • About the same
  • Slightly faster
  • 10-100x slower
  • Always 2x slower

Answer: 10-100x slower. Looping with iterrows runs interpreted Python per row, typically 10-100x slower.

What is a good vectorized replacement for an if/else .apply?

  • np.where(cond, x, y)
  • a for loop
  • df.iterrows()
  • print()

Answer: np.where(cond, x, y). np.where applies conditional logic vectorized across the whole column.

When is .apply acceptable?

  • Always, it's fastest
  • For every column
  • Never use it
  • Only when no vectorized equivalent exists

Answer: Only when no vectorized equivalent exists. .apply is essentially a loop; reserve it for cases with no vectorized alternative.

Which dtype best stores a low-cardinality repeated string column?

  • category
  • object
  • float64
  • int64

Answer: category. category stores each unique value once plus tiny integer codes, saving lots of memory.

How do you measure a DataFrame's true memory use including object strings?

  • len(df)
  • df.memory_usage(deep=True)
  • df.size
  • df.count()

Answer: df.memory_usage(deep=True). deep=True accounts for the actual size of object (string) columns.

When does the category dtype HURT rather than help?

  • When values repeat a lot
  • When the column is numeric only
  • When almost every value is unique (high cardinality)
  • Never

Answer: When almost every value is unique (high cardinality). With mostly-unique values, category adds overhead instead of saving memory.

Which downcasts a column to a smaller integer type automatically?

  • pd.to_numeric(s, downcast='integer')
  • s.astype('object')
  • s.apply(int)
  • s.round()

Answer: pd.to_numeric(s, downcast='integer'). pd.to_numeric with downcast='integer' picks the smallest integer dtype that fits.

Is %timeit plain Python or a Jupyter magic?

  • A pandas method
  • A built-in function
  • A Jupyter/IPython magic command
  • A NumPy function

Answer: A Jupyter/IPython magic command. %timeit is an IPython magic that times a line over many runs; it isn't plain Python.

Which of these string cleanups is vectorized (no apply needed)?

  • A for loop over names
  • df.iterrows()
  • json parsing
  • names.str.strip().str.title()

Answer: names.str.strip().str.title(). The .str accessor runs vectorized string operations across the Series.