githubEdit

Arrow, Parquet, and OTFs

Apache Arrow / Parquet

circle-info

A cursory search will turn up extensive literature (engineering blogs, forum posts, docs, etc.) describing the benefits of columnar data formats for cloud-based environments, where efficient data transfer and read throughput have direct implications on cloud spend. For a more technical deep-dive, consult the relevant Arrowarrow-up-right and Parquetarrow-up-right specifications/documentation.

As a user of the Material Project's arrow-backed data products, the most important concepts to understand and keep in mind are predicate pushdown and column pruning, or in lay-terms reading only what you need.

circle-info

Many parquet query tools support SQL or SQL-like syntax, though programmatic APIs (e.g., PyArrow, Polars) are equally common.

  • Predicate Pushdown: A WHERE/filter condition (or $match if you have a MongoDB-background) that allows your chosen parquet reader to efficiently skip row groupsarrow-up-right. As an example, say I want to query a parquet-based table of taskarrow-up-right documents with calculation output s (a struct-type field) that have a bandgap greater than 0.1:

SELECT *
FROM   tasks_table
WHERE  output.bandgap > 0.1
  • Column Pruning: Refers to simply SELECTing (mongo -> $project) which columns you need. Since parquet is a columnar format, rather than reading everything into memory and then selecting the columns you care about, let your chosen parquet reader skip the columns you don't care about. Using the same predicate as before, let's say I only care about the output struct's structure field:

SELECT output.structure as structure,
       output.bandgap as bandgap
FROM   tasks_table
WHERE  output.bandgap > 0.1

Various examples using actual Materials Project data can be found in Leveraging External Query Engines.

Open Table Formats (OTFs)

Since parquet files are simply just files, the need for a data management abstraction on top of raw files led to the development of several open table formats (OTFs), including Apache Iceberg, Delta Lake, and Apache Hudi. The Materials Project's chosen table format is Deltaarrow-up-right, but this may change in the future. A more in-depth discussion of OTFs and their benefits can be found herearrow-up-right.

Last updated

Was this helpful?