# Arrow, Parquet, and OTFs

### Apache Arrow / Parquet

{% hint style="info" %}
*A cursory search will turn up extensive literature (engineering blogs, forum posts, docs, etc.) describing the benefits of columnar data formats for cloud-based environments, where efficient data transfer and read throughput have direct implications on cloud spend. For a more technical deep-dive, consult the relevant* [*Arrow*](https://arrow.apache.org/docs/format/Intro.html) *and* [*Parquet*](https://parquet.apache.org/docs/file-format/) *specifications/documentation.*
{% endhint %}

As a user of the Material Project's arrow-backed data products, the most important concepts to understand and keep in mind are *predicate pushdown* and *column pruning*, or in lay-terms *reading only what you need*.

{% hint style="info" %}
*Many parquet query tools support SQL or SQL-like syntax, though programmatic APIs (e.g., PyArrow, Polars) are equally common.*
{% endhint %}

* Predicate Pushdown: A `WHERE`/filter condition (or `$match` if you have a MongoDB-background) that allows your chosen parquet reader to efficiently skip [row groups](https://parquet.apache.org/docs/concepts/). As an example, say I want to query a parquet-based table of [`task`](https://github.com/materialsproject/emmet/blob/324115fa404d37bb7170c95f8570d590fbc98905/emmet-core/emmet/core/tasks.py#L276) documents with calculation `output` s (a struct-type field) that have a `bandgap` greater than 0.1:

```sql
SELECT *
FROM   tasks_table
WHERE  output.bandgap > 0.1
```

* Column Pruning: Refers to simply `SELECT`ing (mongo -> `$project`) which columns you need. Since parquet is a columnar format, rather than reading everything into memory and then selecting the columns you care about, let your chosen parquet reader skip the columns you don't care about. Using the same predicate as before, let's say I only care about the `output` struct's `structure` field:

```sql
SELECT output.structure as structure,
       output.bandgap as bandgap
FROM   tasks_table
WHERE  output.bandgap > 0.1
```

Various examples using actual Materials Project data can be found in [Leveraging External Query Engines](/materials-project-data-lakehouse/leveraging-external-query-engines.md).

### Open Table Formats (OTFs)

Since parquet files are simply just files, the need for a data management abstraction on top of raw files led to the development of several open table formats (OTFs), including Apache Iceberg, Delta Lake, and Apache Hudi. The Materials Project's chosen table format is [Delta](https://delta.io/), but this may change in the future. A more in-depth discussion of OTFs and their benefits can be found [here](https://delta.io/blog/open-table-formats/).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.materialsproject.org/materials-project-data-lakehouse/arrow-parquet-and-otfs.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
