Efficient Formats: Parquet, Feather & Pickle

Parquet, Feather, and Pickle are binary file formats that store a DataFrame far faster and smaller than CSV while preserving exact dtypes, so a saved file loads back as the same data you saved.

Learn Efficient Formats: Parquet, Feather & Pickle in our free Pandas course — a beginner-friendly interactive lesson with worked examples, a practice…

Part of the free Pandas course at LearnCodingFast — hands-on lessons with examples you run in your browser, plus practice exercises and a quick quiz.

You'll learn why columnar formats beat CSV, how to use to_parquet / read_parquet , and the always-available to_pickle / read_pickle pair.

CSV is row-oriented plain text with no type information: every number is stored as characters and re-parsed on every read, and a date saved as text comes back as a string. Columnar binary formats like Parquet store each column together with its dtype and compression, so they read back the exact same data — much faster and far smaller.

The API mirrors every other pandas reader/writer. Parquet and Feather both need the pyarrow engine installed ( pip install pyarrow ). Parquet compresses hardest and is the archival standard; Feather (Arrow IPC) is tuned for the fastest possible read/write of intermediate results.

to_pickle / read_pickle serialise any DataFrame to a binary blob using Python's built-in pickle. No extra packages, perfect dtype fidelity, and it handles odd column types other formats reject. The trade-off: it is Python-only and unsafe to load from untrusted sources.

Prove a round-trip is lossless across mixed dtypes.

Lesson complete — you can store data the fast way!

You know why columnar Parquet and Feather beat CSV on speed, size, and dtype fidelity, and you can always fall back to the built-in to_pickle / read_pickle for a lossless round-trip.

🚀 Up next: Ranking & Top-N — order rows and pull out the best with rank and nlargest .

Practice quiz

How is Parquet stored compared to CSV?

Row-oriented plain text
Columnar binary with dtypes and compression
As JSON
As a Python script

Answer: Columnar binary with dtypes and compression. Parquet is a columnar binary format that records schema, dtypes, and compresses each column.

What happens to a datetime column saved to CSV and read back?

It stays datetime
It comes back as plain text (string)
It becomes an integer
It is dropped

Answer: It comes back as plain text (string). CSV is typeless text, so dates round-trip as strings unless you re-parse them.

Which method writes a DataFrame to Parquet?

df.write_parquet()
pd.parquet()
df.to_parquet()
df.save()

Answer: df.to_parquet(). df.to_parquet('file.parquet') writes; pd.read_parquet reads it back.

What library do Parquet and Feather require?

pyarrow
numpy
requests
openpyxl

Answer: pyarrow. Both formats rely on the pyarrow (Apache Arrow) engine.

Which format is built into Python and needs no extra packages?

Parquet
Feather
Pickle
ORC

Answer: Pickle. to_pickle/read_pickle ship with pandas and always work without dependencies.

Why is Pickle unsafe to load from untrusted sources?

It is too large
It loses dtypes
It can execute arbitrary code when loaded
It only works on CSV

Answer: It can execute arbitrary code when loaded. Unpickling can run arbitrary code, so only load pickles you trust.

What advantage does columnar storage give when reading?

You can read just the columns you need
It auto-sorts rows
It removes NaN
It adds an index

Answer: You can read just the columns you need. Columnar Parquet lets you read a subset of columns without scanning the whole file.

What must you do before reading a BytesIO buffer you just wrote?

Close the file
Call buf.seek(0) to rewind
Convert to CSV
Delete it

Answer: Call buf.seek(0) to rewind. Without seeking back to the start, the reader hits EOF and reads nothing.

Which format is tuned for the fastest read/write of intermediate results?

CSV
Feather (Arrow IPC)
JSON
Excel

Answer: Feather (Arrow IPC). Feather is optimised for very fast read/write; Parquet compresses harder for archival.

After to_pickle then read_pickle, what does df.equals(loaded) return?

False
An error
NaN
True

Answer: True. Pickle is a lossless round-trip, so the frames are identical.