Anatomy of a Parquet File

In recent times, Parquet has change into a typical format for information storage in Large Information ecosystems. Its column-oriented format gives a number of benefits:

Quicker question execution when solely a subset of columns is being processed
Fast calculation of statistics throughout all information
Diminished storage quantity because of environment friendly compression

When mixed with storage frameworks like Delta Lake or Apache Iceberg, it seamlessly integrates with question engines (e.g., Trino) and information warehouse compute clusters (e.g., Snowflake, BigQuery). On this article, the content material of a Parquet file is dissected utilizing primarily commonplace Python instruments to higher perceive its construction and the way it contributes to such performances.

Writing Parquet file(s)

To provide Parquet information, we use PyArrow, a Python binding for Apache Arrow that shops dataframes in reminiscence in columnar format. PyArrow permits fine-grained parameter tuning when writing the file. This makes PyArrow ultimate for Parquet manipulation (one also can merely use Pandas).

# generator.py

import pyarrow as pa
import pyarrow.parquet as pq
from faker import Faker

faux = Faker()
Faker.seed(12345)
num_records = 100

# Generate faux information
names = [fake.name() for _ in range(num_records)]
addresses = [fake.address().replace("n", ", ") for _ in range(num_records)]
birth_dates = [
    fake.date_of_birth(minimum_age=67, maximum_age=75) for _ in range(num_records)
]
cities = [addr.split(", ")[1] for addr in addresses]
birth_years = [date.year for date in birth_dates]

# Solid the info to the Arrow format
name_array = pa.array(names, kind=pa.string())
address_array = pa.array(addresses, kind=pa.string())
birth_date_array = pa.array(birth_dates, kind=pa.date32())
city_array = pa.array(cities, kind=pa.string())
birth_year_array = pa.array(birth_years, kind=pa.int32())

# Create schema with non-nullable fields
schema = pa.schema(
    [
        pa.field("name", pa.string(), nullable=False),
        pa.field("address", pa.string(), nullable=False),
        pa.field("date_of_birth", pa.date32(), nullable=False),
        pa.field("city", pa.string(), nullable=False),
        pa.field("birth_year", pa.int32(), nullable=False),
    ]
)

desk = pa.Desk.from_arrays(
    [name_array, address_array, birth_date_array, city_array, birth_year_array],
    schema=schema,
)

print(desk)

pyarrow.Desk
identify: string not null
tackle: string not null
date_of_birth: date32[day] not null
metropolis: string not null
birth_year: int32 not null
----
identify: [["Adam Bryan","Jacob Lee","Candice Martinez","Justin Thompson","Heather Rubio"]]
tackle: [["822 Jennifer Field Suite 507, Anthonyhaven, UT 98088","292 Garcia Mall, Lake Belindafurt, IN 69129","31738 Jonathan Mews Apt. 024, East Tammiestad, ND 45323","00716 Kristina Trail Suite 381, Howelltown, SC 64961","351 Christopher Expressway Suite 332, West Edward, CO 68607"]]
date_of_birth: [[1955-06-03,1950-06-24,1955-01-29,1957-02-18,1956-09-04]]
metropolis: [["Anthonyhaven","Lake Belindafurt","East Tammiestad","Howelltown","West Edward"]]
birth_year: [[1955,1950,1955,1957,1956]]

The output clearly displays a columns-oriented storage, not like Pandas, which normally shows a standard “row-wise” desk.

How is a Parquet file saved?

Parquet information are usually saved in low-cost object storage databases like S3 (AWS) or GCS (GCP) to be simply accessible by information processing pipelines. These information are normally organized with a partitioning technique by leveraging listing buildings:

# generator.py

num_records = 100

# ...

# Writing the parquet information to disk
pq.write_to_dataset(
    desk,
    root_path='dataset',
    partition_cols=['birth_year', 'city']
)

If birth_year and metropolis columns are outlined as partitioning keys, PyArrow creates such a tree construction within the listing dataset:

dataset/
├─ birth_year=1949/
├─ birth_year=1950/
│ ├─ metropolis=Aaronbury/
│ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
│ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
│ │ ├─ …
│ ├─ metropolis=Alicialand/
│ ├─ …
├─ birth_year=1951 ├─ ...

The technique allows partition pruning: when a question filters on these columns, the engine can use folder names to learn solely the mandatory information. That is why the partitioning technique is essential for limiting delay, I/O, and compute sources when dealing with massive volumes of knowledge (as has been the case for many years with conventional relational databases).

The pruning impact could be simply verified by counting the information opened by a Python script that filters the start 12 months:

# question.py
import duckdb

duckdb.sql(
    """
    SELECT * 
    FROM read_parquet('dataset/*/*/*.parquet', hive_partitioning = true)
    the place birth_year = 1949
    """
).present()

> strace -e hint=open,openat,learn -f python question.py 2>&1 | grep "dataset/.*.parquet"

[pid    37] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    37] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=DPOpercent20APpercent2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=DPOpercent20APpercent2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Eastpercent20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Eastpercent20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=FPOpercent20AApercent2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=FPOpercent20AApercent2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Newpercent20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Newpercent20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Northpercent20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Northpercent20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Portpercent20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Portpercent20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
[pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3

Solely 23 information are learn out of 100.

Studying a uncooked Parquet file

Let’s decode a uncooked Parquet file with out specialised libraries. For simplicity, the dataset is dumped right into a single file with out compression or encoding.

# generator.py

# ...

pq.write_table(
    desk,
    "dataset.parquet",
    use_dictionary=False,
    compression="NONE",
    write_statistics=True,
    column_encoding=None,
)

The very first thing to know is that the binary file is framed by 4 bytes whose ASCII illustration is “PAR1”. The file is corrupted if this isn’t the case.

# reader.py

with open("dataset.parquet", "rb") as file:
    parquet_data = file.learn()

assert parquet_data[:4] == b"PAR1", "Not a sound parquet file"
assert parquet_data[-4:] == b"PAR1", "File footer is corrupted"

As indicated within the documentation, the file is split into two components: the “row teams” containing precise information, and the footer containing metadata (schema under).

The footer

The scale of the footer is indicated within the 4 bytes previous the top marker as an unsigned integer written in “little endian” format (famous “unpack perform).

# reader.py

import struct

# ...

footer_length = struct.unpack("

Footer measurement in bytes: 1088

The footer data is encoded in a cross-language serialization format referred to as Apache Thrift. Utilizing a human-readable however verbose format like JSON after which translating it into binary can be much less environment friendly by way of reminiscence utilization. With Thrift, one can declare information buildings as follows:

struct Buyer { 1: required string identify, 2: optionally available i16 birthYear, 3: optionally available listing pursuits }

On the idea of this declaration, Thrift can generate Python code to decode byte strings with such information construction (it additionally generates code to carry out the encoding half). The thrift file containing all the info buildings carried out in a Parquet file could be downloaded right here. After having put in the thrift binary, let’s run:

thrift -r --gen py parquet.thrift

The generated Python code is positioned within the “gen-py” folder. The footer’s information construction is represented by the FileMetaData class – a Python class routinely generated from the Thrift schema. Utilizing Thrift’s Python utilities, binary information is parsed and populated into an occasion of this FileMetaData class.

# reader.py import sys # ... # Add the generated lessons to the python path sys.path.append("gen-py") from parquet.ttypes import FileMetaData, PageHeader from thrift.transport import TTransport from thrift.protocol import TCompactProtocol def read_thrift(information, thrift_instance): """ Learn a Thrift object from a binary buffer. Returns the Thrift object and the variety of bytes learn. """ transport = TTransport.TMemoryBuffer(information) protocol = TCompactProtocol.TCompactProtocol(transport) thrift_instance.learn(protocol) return thrift_instance, transport._buffer.inform() # The variety of bytes learn just isn't used for now file_metadata_thrift, _ = read_thrift(footer_data, FileMetaData()) print(f"Variety of rows in the entire file: {file_metadata_thrift.num_rows}") print(f"Variety of row teams: {len(file_metadata_thrift.row_groups)}") Variety of rows in the entire file: 100 Variety of row teams: 1

The footer accommodates in depth details about the file’s construction and content material. For example, it precisely tracks the variety of rows within the generated dataframe. These rows are all contained inside a single “row group.” However what’s a “row group?”

Row teams

In contrast to purely column-oriented codecs, Parquet employs a hybrid strategy. Earlier than writing column blocks, the dataframe is first partitioned vertically into row teams (the parquet file we generated is simply too small to be cut up in a number of row teams).

This hybrid construction gives a number of benefits:

Parquet calculates statistics (corresponding to min/max values) for every column inside every row group. These statistics are essential for question optimization, permitting question engines to skip complete row teams that don’t match filtering standards. For instance, if a question filters for birth_year > 1955 and a row group’s most start 12 months is 1954, the engine can effectively skip that complete information part. This optimisation is named “predicate pushdown”. Parquet additionally shops different helpful statistics like distinct worth counts and null counts.

# reader.py # ... first_row_group = file_metadata_thrift.row_groups[0] birth_year_column = first_row_group.columns[4] min_stat_bytes = birth_year_column.meta_data.statistics.min max_stat_bytes = birth_year_column.meta_data.statistics.max min_year = struct.unpack("

The start 12 months vary is between 1949 and 1958

Row teams allow parallel processing of knowledge (notably beneficial for frameworks like Apache Spark). The scale of those row teams could be configured based mostly on the computing sources accessible (utilizing the row_group_size property in perform write_table when utilizing PyArrow).

# generator.py # ... pq.write_table( desk, "dataset.parquet", row_group_size=100, ) # /! Hold the default worth of "row_group_size" for the following components

Even when this isn’t the first goal of a column format, Parquet’s hybrid construction maintains cheap efficiency when reconstructing full rows. With out row teams, rebuilding a whole row would possibly require scanning everything of every column which might be extraordinarily inefficient for giant information.

Information Pages

The smallest substructure of a Parquet file is the web page. It accommodates a sequence of values from the identical column and, subsequently, of the identical kind. The selection of web page measurement is the results of a trade-off:

Bigger pages imply much less metadata to retailer and skim, which is perfect for queries with minimal filtering.

Smaller pages cut back the quantity of pointless information learn, which is best when queries goal small, scattered information ranges.

Now let’s decode the contents of the primary web page of the column devoted to addresses whose location could be discovered within the footer (given by the data_page_offset attribute of the proper ColumnMetaData) . Every web page is preceded by a Thrift PageHeader object containing some metadata. The offset really factors to a Thrift binary illustration of the web page metadata that precedes the web page itself. The Thrift class is named a PageHeader and will also be discovered within the gen-py listing.

You might also like

Bringing which means into expertise deployment | MIT Information

Google for Nonprofits to develop to 100+ new international locations and launch 10+ new no-cost AI options

When “Sufficient” Nonetheless Feels Empty: Sitting within the Ache of What’s Subsequent | by Chrissie Michelle, PhD Survivors Area | Jun, 2025

💡 Between the PageHeader and the precise values contained inside the web page, there could also be just a few bytes devoted to implementing the Dremel format, which permits encoding nested information buildings. Since our information has an everyday tabular format and the values aren’t nullable, these bytes are skipped when writing the file (https://parquet.apache.org/docs/file-format/data-pages/).

# reader.py # ... address_column = first_row_group.columns[1] column_start = address_column.meta_data.data_page_offset column_end = column_start + address_column.meta_data.total_compressed_size column_content = parquet_data[column_start:column_end] page_thrift, page_header_size = read_thrift(column_content, PageHeader()) page_content = column_content[ page_header_size : (page_header_size + page_thrift.compressed_page_size) ] print(column_content[:100])

b'6x00x00x00481 Mata Squares Suite 260, Lake Rachelville, KY 874642x00x00x00671 Barker Crossing Suite 390, Mooreto'

The generated values lastly seem, in plain textual content and never encoded (as specified when writing the Parquet file). Nevertheless, to optimize the columnar format, it is suggested to make use of one of many following encoding algorithms: dictionary encoding, run size encoding (RLE), or delta encoding (the latter being reserved for int32 and int64 sorts), adopted by compression utilizing gzip or snappy (accessible codecs are listed right here). Since encoded pages include comparable values (all addresses, all decimal numbers, and many others.), compression ratios could be notably advantageous.

As documented within the specification, when character strings (BYTE_ARRAY) aren’t encoded, every worth is preceded by its measurement represented as a 4-byte integer. This may be noticed within the earlier output:

To learn all of the values (for instance, the primary 10), the loop is reasonably easy:

idx = 0 for _ in vary(10): str_size = struct.unpack("

481 Mata Squares Suite 260, Lake Rachelville, KY 87464 671 Barker Crossing Suite 390, Mooretown, MI 21488 62459 Jordan Knoll Apt. 970, Emilyfort, DC 80068 948 Victor Sq. Apt. 753, Braybury, RI 67113 365 Edward Place Apt. 162, Calebborough, AL 13037 894 Reed Lock, New Davidmouth, NV 84612 24082 Allison Squares Suite 345, North Sharonberg, WY 97642 00266 Johnson Drives, South Lori, MI 98513 15255 Kelly Plains, Richardmouth, GA 33438 260 Thomas Glens, Port Gabriela, OH 96758

And there we’ve it! Now we have efficiently recreated, in a quite simple means, how a specialised library would learn a Parquet file. By understanding its constructing blocks together with headers, footers, row teams, and information pages, we are able to higher recognize how options like predicate pushdown and partition pruning ship such spectacular efficiency advantages in data-intensive environments. I’m satisfied understanding how Parquet works below the hood helps making higher choices about storage methods, compression selections, and efficiency optimization.

All of the code used on this article is out there on my GitHub repository at https://github.com/kili-mandjaro/anatomy-parquet, the place you possibly can discover extra examples and experiment with totally different Parquet file configurations.

Whether or not you might be constructing information pipelines, optimizing question efficiency, or just interested by information storage codecs, I hope this deep dive into Parquet’s internal buildings has supplied beneficial insights to your Information Engineering journey.

All photos are by the writer.

Tags: Anatomy File Parquet

Anatomy of a Parquet File

You might also like

Bringing which means into expertise deployment | MIT Information

Google for Nonprofits to develop to 100+ new international locations and launch 10+ new no-cost AI options

When “Sufficient” Nonetheless Feels Empty: Sitting within the Ache of What’s Subsequent | by Chrissie Michelle, PhD Survivors Area | Jun, 2025

How AI is Revolutionizing Video Content material Creation

The 5 Vital Parts of XDR Integration: A Complete Information

Md Sazzad Hossain

Related Posts

Bringing which means into expertise deployment | MIT Information

Google for Nonprofits to develop to 100+ new international locations and launch 10+ new no-cost AI options

When “Sufficient” Nonetheless Feels Empty: Sitting within the Ache of What’s Subsequent | by Chrissie Michelle, PhD Survivors Area | Jun, 2025

Apple Machine Studying Analysis at CVPR 2025

Constructing clever AI voice brokers with Pipecat and Amazon Bedrock – Half 1

The 5 Vital Parts of XDR Integration: A Complete Information

Leave a Reply Cancel reply

Recommended

From a Level to L∞ | In the direction of Information Science

Digital Sovereignty within the Age of AI – IT Connection

Categories

CyberDefenseGo

Recent

Powering All Ethernet AI Networking

6 New ChatGPT Tasks Options You Have to Know

Search

Welcome Back!

Retrieve your password