AWS OpenData
MP data is also available through the AWS OpenData Program.
In an effort to make our data as accessible as possible (FAIR principle) as well as significantly improve data downloads and take pressure off our servers, we are making a growing list of our data products available through the AWS OpenData Program. Also see the entries for MP-managed data on the AWS OpenData Registry or the AWS Data Exchange.
Overview
MP data is organized in 3 buckets named materialsproject-{raw,parsed,build}
. Note that the particular organization of our data in these buckets is still in flux and can change without notice as we integrate them into our cloud infrastructure.
raw data
We are in the process of providing VASP output files for our calculations in the raw
bucket. Look out for announcements through our email lists and notifications on our website.
parsed data
The parsed
bucket contains objects that MP generates by parsing the VASP output files. The objects form the basis for our builder pipelines which create the derived high-level data collections served through the MP API and website. All S3 objects in this bucket are serialized pymatgen
or emmet
python objects and most are stored as gzip-compressed JSON files for each MP ID (i.e. <prefix>/<mp-id>.json.gz
). We are in the process of grouping documents into JSON Lines (JSONL) files to reduce the number of files and significantly improve transfer speeds. tasks
are now organized by nelements/output.spacegroup.number
and a timestamp (dt
) derived from the earliest completed_at
in the list of tasks included in the respective object.
prefix | # objects | size |
---|---|---|
| 691k | 63.1 GB |
| 705k | 1.4 TB |
| 400k | 7.2 TB |
| 138.7k each | 1.1 TB each |
| 107.5k | 101 GB |
| 158k | 2.5 TB |
| 1556 | 34 GB |
build data
The build
bucket contains the high-level derived data that comprises the source for the collections available through the MP API as well as pre-built objects and images for efficient visualization on the website.
The collections and pre-built objects are versioned by the database release date and individual documents grouped into gzip-compressed JSONL files. Images are stored in PNG format. Use the ls
command for the AWS CLI or the bucket explorer to list the categories available under each prefix (see download section below).
prefix | version | # objects | size |
---|---|---|---|
|
| 12.6k | 2.8 GB |
| 18.4k | 6.1 GB | |
|
| 289k | 55.9 GB |
| N/A | 200k | 58 GB |
Explore & Download Data
We are in the process of integrating all available data into the mp-api
python client for improved convenience and efficiency. However, all data in MP's OpenData buckets can always be downloaded directly using the AWS CLI.
Start by exploring the contents of the bucket you're interested in, by either navigating to the bucket's web interface (e.g. https://materialsproject-parsed.s3.amazonaws.com/index.html) or using the CLI's ls
command:
All objects for a prefix can be downloaded, using the format
Last updated