Arrow, Parquet, and OTFs

Apache Arrow / Parquet

A cursory search will turn up extensive literature (engineering blogs, forum posts, docs, etc.) describing the benefits of columnar data formats for cloud-based environments, where efficient data transfer and read throughput have direct implications on cloud spend. For a more technical deep-dive, consult the relevant Arrow and Parquet specifications/documentation.

As a user of the Material Project's arrow-backed data products, the most important concepts to understand and keep in mind are predicate pushdown and column pruning, or in lay-terms reading only what you need.

Many parquet query tools support SQL or SQL-like syntax, though programmatic APIs (e.g., PyArrow, Polars) are equally common.

Predicate Pushdown: A WHERE/filter condition (or $match if you have a MongoDB-background) that allows your chosen parquet reader to efficiently skip row groups. As an example, say I want to query a parquet-based table of task documents with calculation output s (a struct-type field) that have a bandgap greater than 0.1:

SELECT *
FROM   tasks_table
WHERE  output.bandgap > 0.1

Column Pruning: Refers to simply SELECTing (mongo -> $project) which columns you need. Since parquet is a columnar format, rather than reading everything into memory and then selecting the columns you care about, let your chosen parquet reader skip the columns you don't care about. Using the same predicate as before, let's say I only care about the output struct's structure field:

SELECT output.structure as structure,
       output.bandgap as bandgap
FROM   tasks_table
WHERE  output.bandgap > 0.1

Various examples using actual Materials Project data can be found in Leveraging External Query Engines.

Open Table Formats (OTFs)

Since parquet files are simply just files, the need for a data management abstraction on top of raw files led to the development of several open table formats (OTFs), including Apache Iceberg, Delta Lake, and Apache Hudi. The Materials Project's chosen table format is Delta, but this may change in the future. A more in-depth discussion of OTFs and their benefits can be found here.

PreviousAccess-controlled Data NextArrow Datasets & the MPDataset Interface

Last updated 21 days ago

Was this helpful?

hashtagApache Arrow / Parquet

hashtagOpen Table Formats (OTFs)

Apache Arrow / Parquet

Open Table Formats (OTFs)