githubEdit

Legacy json Data on AWS Open Data

An introduction on how to interact with the json format data hosted on the Materials Project AWS Open Data buckets.

circle-info

All json format data products stored in the buckets of Materials Project's OpenData repositories are considered 'legacy' and candidates for migration to cloud-native data formats. Consult the documentation for MP's active data products for updated usage patterns.

To run the examples in this section, it will help to use a python environment with pandas and s3fs installed

Structured vs. unstructured data

Much of the raw simulation data that the Materials Project (MP) uses to build up material properties is in plain text format. This is simply because the software used in scientific applications (e.g., VASP, Quantum Espresso, etc.) predates modern data structure standards.

When the extended MP universe began to explore performing high-throughput materials simulations in the early late 2000s, it became clear very quickly that interacting with plain text files would not be practical. pymatgen was built around the idea of using JSON to structure raw simulation input and output: all VASP-related objects in pymatgen have a to_dict and from_dict method to make round-tripping from JSON possible. JSON was chosen partly because it is very simple, compatible with python-native dictionaries, and because it forms the document structure of MongoDB, which was selected for use in orchestrating MP's workflows.

While JSON is supported by many languages, it is not well-suited to data streaming nor for partial retrieval of data. JSON linesarrow-up-right (.jsonl) is one attempt to allow for streaming of JSON data. Each line of a JSONL file contains a new JSON object. Much of MP's data up to mid-2025 has been distributed as JSONL for this reason.

Worked example: JSON

Let's look at the manifest.jsonl file which represents the high-level metadata of MP's summary data collection. This file is located on MP's OpenData build bucketarrow-up-right:

import pandas as pd

summary_metadata = pd.read_json(
    "s3://materialsproject-build/collections/2025-09-25/summary/manifest.jsonl.gz",
    lines = True
)
print(summary_metadata.columns)
>>> ['band_gap', 'density', 'deprecated', 'e_electronic', 'e_total', 'energy_above_hull', 'formation_energy_per_atom', 'formula_pretty', 'last_updated', 'material_id', 'nelements', 'sourced_from_path', 'symmetry_number', 'task_ids', 'theoretical', 'total_magnetization']

Suppose we wanted to retrieve all materials in MP with a band gap between 0.1 and 1.0 eV, and a hull energy less than 0.05 eV/atom. To do this with the manifest.jsonl.gz file, we would need to download the entire file and then filter using the pandas.DataFrame shown above. In reality, the only columns we need to do that filtering are: band_gap, material_id, and energy_above_hull.

Suppose now that we did not have the manifest.jsonl file, and needed to extract the same information from the entire summary collection. We would have to: 1. Iterate over all JSONL files in the summary bucket 2. Download each JSONL file in its entirety 3. Save only the material IDs of those materials matching our filters

Last updated

Was this helpful?