If you work with data, you've probably seen .parquet files. They're everywhere — in Spark jobs, data lakes, machine learning pipelines, and analytics platforms. But what exactly are they, and why should you care?
A Parquet file stores data in columns instead of rows, which makes analytical queries (like "what's the average salary?") dramatically faster and files dramatically smaller.
Imagine a spreadsheet with 1 million employees and 50 columns. In a traditional format like CSV:
In Parquet (columnar storage):
| Feature | CSV | Parquet |
|---|---|---|
| Storage layout | Row-based | Column-based |
| File size (typical) | 100 MB | 10-30 MB (compressed) |
| Read speed (single column) | Slow (reads all columns) | Fast (reads only needed columns) |
| Schema | No (just text) | Yes (typed columns with metadata) |
| Human readable | Yes (open in Notepad) | No (binary format) |
| Best for | Small data, quick sharing | Analytics, data pipelines, ML |
Three reasons:
Since Parquet is binary, you can't open it in a text editor. Options:
import pandas; df = pandas.read_parquet("file.parquet")duckdb -c "SELECT * FROM 'file.parquet' LIMIT 10"⚡ Want to explore a Parquet file right now?
Open FileScope — drag & drop any .parquet file →Parquet is the default file format for modern data work. If you're dealing with datasets larger than a few thousand rows and doing any kind of analysis, switching from CSV to Parquet will make your files 3-10x smaller and your queries 10-100x faster. It's one of those rare upgrades that's all upside.