githubEdit

Leveraging External Query Engines

Leaving Python land

An additional benefit of MP's data being publicly accessible in a format like parquet (with Deltaarrow-up-right on top) is the new level of freedom that you as a user now have for interacting with MP's data products and implementing custom pipelines that may not be readily available through the mp-api Python client.

circle-info

Remember to familiarize yourself with the various Terms of Usearrow-up-right and access-restrictions for each of the data products in MP's data lake. Interacting with MP's data at the "bare-metal" level removes all guardrails implemented in the mp-api

DuckDB

DuckDBarrow-up-right is a popular in-process database/query engine with many language bindingsarrow-up-right. DuckDB also has supportarrow-up-right for the Delta Lake format, making DuckDB an ergonomic way of interacting with MP's data lake.

Here is a simple, somewhat contrived, pipeline for retrieving all DOS available in the Materials Project for hexagonal materials that have been calculated with a GGA+U run_type:

circle-info

Query execution speed will be limited by network speeds when interacting with remote data

-- saved as gga_hex.sql
-- Install and load the delta extension
INSTALL delta;
LOAD delta;

-- MP's data is in a public bucket, create
-- simple secret w/ the correct region
CREATE SECRET (
    TYPE s3,
    REGION 'us-east-1'
);

-- Create some views for ease of reference later
CREATE VIEW mp_tasks AS
    SELECT * FROM delta_scan('s3://materialsproject-parsed/core/tasks/');

CREATE VIEW mp_dos AS
    SELECT * FROM delta_scan('s3://materialsproject-parsed/core/electronic-structure/total-dos/');

-- Store all hexagonal GGA+U task_ids
SET VARIABLE gga_hex_task_ids = (
    SELECT list(task_id)
    FROM   (
        SELECT task_id,
               run_type,
               symmetry.crystal_system AS crystal_system
        FROM   mp_tasks
        WHERE  run_type = 'GGA+U'
          AND  crystal_system = 'Hexagonal'
    ) AS hex_gga
);

-- Query DOS table, store results in local parquet file
COPY (
    SELECT *
    FROM   mp_dos
    WHERE  identifier IN getvariable('gga_hex_task_ids')
) TO 'gga_hex_dos.parquet.zstd' (FORMAT parquet, COMPRESSION zstd);

Which can be ran by invoking the DuckDB CLI:

and will return a parquet file with around 3800 DOS (in the formatarrow-up-right MP uses for storing DOS entries).

More Options

Depending on your infrastructure/interests the data products in MP's data lake are compatible with any query engine that speaks Delta Lake, some examples: Sparkarrow-up-right, Trinoarrow-up-right, BigQueryarrow-up-right, Athenaarrow-up-right, Fabricarrow-up-right.

If you develop something cool feel free to reach out and share it with the community!

Last updated

Was this helpful?