githubEdit

Arrow Datasets & the MPDataset Interface

MP's python-like interface for arrow-backed data products and (anti)patterns for working with arrow data.

Arrow Datasets(& Delta Tables)

For very large datasets, it is often impossible (or very unwieldy) to store the data as a single parquet file. An arrow dataset can be used parse the high-level metadata of multiple parquet files to yield a single entry point to the data contained in multiple parquet files. MP's tasks collection, in the parsed bucketarrow-up-right is an arrow dataset (with Deltaarrow-up-right on top). MP's python API client, mp-api, has tools to retrieve this data in its entirety and paginate through it in a memory-efficient way:

>>> from mp_api.client import MPRester

>>> with MPRester(<YOUR_API_KEY>) as mpr:
...    tasks = mpr.materials.tasks.search()

The full tasks collection requires >10 GB of on-disk space even in an efficient parquet representation. The client query above will download all tasks to your machine and allow you to load them into memory as they are requested.

circle-info

A list of endpoints with arrow support and mp-api integration can be found at Supported Data Products

MPDatasets

The return type of the tasks variable in above snippet is a MPDatasetarrow-up-right - a thin wrapper around the underlying arrow dataset that has been stored on disk. To preserve the behavior of any existing user code, MPDataset objects behave as would be expected from the return value of any other MPRester query, i.e., like a typical iterable container of Pydantic models or python dictionaries. Indexing, slicing, and looping behave accordingly, but warnings will be raised indicating this is sub-optimal usage:

>>> _ = tasks[0]
<stdin>:1: MPDatasetIndexingWarning:
            Pythonic indexing into arrow-based MPDatasets is sub-optimal, consider using
            idiomatic arrow patterns. See MP's docs on MPDatasets for relevant examples:
            docs.materialsproject.org/materials-project-data-lakehouse/arrow-datasets

A Better Path

Here's a real example retrieving the structure field for all r2SCAN documents in the tasks dataset (stored on disk as parquet, above snippet) with non-zero bandgaps.

  • Typical comprehension based filtering (sub-optimal):

  • And filtering using arrow's compute engine (leveraging concepts mentioned here):

non_metallic_r2scan_structures will be a pyarrow Table, but can be de-serialized by calling:

PyArrow's syntax and usage patterns may be uncomfortable at first, but the performance benefits gained from delaying de-serialization (arrow -> python) as long as possible and using arrow's compute engine are well worth the intial learning curve. Consult the PyArrow Cookbookarrow-up-right for some informative examples of leveraging arrow's strengths.

(De)Serialization Stumbling Blocks

circle-info

MP's developers are working to solve this issue so users won't have to think about this in the future, but consider this issue a work-in-progress

Historically, all of MP's data products have been stored in MongoDB collections. MongoDB's (and more generically, json's) flexibility is antithetical to the fully structured format of parquet, which led to some workarounds being needed to serialize certain fields of various MP data products.

An example is the incararrow-up-right field of a CalculationInput , which is an extremely heterogeneous dictionary where the only feasible option for uniformly strictly typing this field was just dumping the dictionary to a string during serializationarrow-up-right, and then re-hydratingarrow-up-right accordingly.

All of the Pydantic document models that define the schemas of MP's data products (available for reference in emmet-corearrow-up-right) will handle this seamlessly (with some coercion using pydantic), but again, delay de-serialization as long as possible for the best performance. To extend the example above, let's fully hydrate the non_metallic_r2scan_structures filter result as pymatgen structures:

This unfortunately incurs two de-serialization hits, but will complete in a reasonable time (order of seconds for this example of 42k entries).

Last updated

Was this helpful?