Task Collection Migrations

DISCLAIMER: Make backups, or generate copies, of any applicable collections/databases before applying destructive operations

Given the wide range of possible outputs from DFT workflows, this set of migrations should not be considered exhaustive.

These migrations are the steps that were taken by the Materials Project staff to migrate MP's core task document collection.

The schema of the task documents generated during calculation parsing has changed over time as the underlying workflows and workflow management software have evolved. Below are the migrations that were required for migrating the Materials Project's 10+ year old core task collection from the TaskDocument produced by atomate's VaspDrone, to emmet-core's TaskDoc (used by atomate2), and finally to the minified emmet-core CoreTaskDoc (the schema of documents returned by the tasks endpoint of the mp-api client).

Server-Side migrations

A few simple field re-names and type coercions were necessary during the transition from TaskDocument to TaskDoc that require no client-side data manipulations. These operations can be safely applied prior to migrating to TaskDoc to make the process smoother. These operations can also be applied to an existing TaskDoc collection.

  • Migration of Potcar Symbols for orig_inputs

    • During the process of updating the CalculationInput class in emmet-core, a number of inconsistencies were found for the potcar field of orig_inputs (ref: emmet PR comment). These updates will address those inconsistencies:

from pymongo import UpdateMany

ops = [
    # Migrate any erroneously parsed emmet.core.vasp.calculation.PotcarSpec's
    # to the correct location
    UpdateMany(
        {
            "$and": [
                {"orig_inputs.potcar": {"$type": "array"}},
                {"orig_inputs.potcar.0": {"$type": "object"}},
            ]
        },
        {
            "$rename": {"orig_inputs.potcar": "orig_inputs.potcar_spec"},
        },
    ),
    # Migrate emmet.core.Potcar (deprecated) struct -> list of potcar symbols
    UpdateMany(
        {"orig_inputs.potcar": {"$type": {"object"}}},
        {"$set": {"orig_inputs.potcar": {"$orig_inputs.potcar.symbols"}}},
    ),
]
  • Migration of has_vasp_completed

    • The transition from atomate to atomate2 resulted in the has_vasp_completed field changing from a boolean to an enumeration. These operations will correct any mis-typed fields:

  • Removal of dropped input and orig_inputs fields

    • A number of fields were dropped in favor turning these values into @property s on the underlying emmet.core.vasp.Calculation class (emmet-core #1226). These fields can be safely removed:

The preceding operations can be executed individually, or concatenated and executed all at once. Consult the MongoDB bulk write documentation for examples of executing bulk writes.

Client-Side Migrations

The following migrations involve more complicated manipulations and can have long run times depending on the size of the source tasks collection.

  • Migration of TaskDocument to TaskDoc

    • The following create_new_taskdoc function can be used to transform a TaskDocument into a document with the TaskDoc schema. Coordination of database operations in this situation are highly dependent on the execution environment and are thus left to the user.

Migration of TaskDoc to CoreTaskDoc

This is a multi-step process that should be done in order

  1. Flattening calcs_reversed

    • Prior to atomate2 , workflows would output multiple calculations into a single directory, leading to the need for a field like calcs_reversed that would allow for parsing multiple calculations in a single directory into a single task document. atomate2 has done away with this and now even for complex, multi-step workflows each individual calculation has its own output directory. Extracting the individual calculations from calcs_reversed is straightforward:

If the source_collection -> target_collection pattern is followed, also be sure to copy all documents from source_collection with len(calcs_reversed) == 1 and set the calcs_reversed field equal to calcs_reversed[0] .

  1. "Parsing" extracted entries from calcs_reversed

    • The calcs_reversed entries from step 1 will need to be transformed into their own TaskDoc documents:

  1. Flattened "TaskDoc" to CoreTaskDoc

    • Another core difference in CoreTaskDoc is the removal of the calculation's "trajectory" (energies/forces tracked across the ionic steps in the calculation: ionic_steps) from the database entries to be stored externally. The size of the ionic_steps field can vary drastically across calculations and was found to be a major contributor to the on-disk storage size of MP's core task collection. See a full discussion here: emmet PR #1232.

    • This snippet uses pyarrow to store the trajectories as parquet files as part of a pyarrow Dataset. Alternative storage formats can be substituted in as well

A dedicated method is available in emmet-core for constructing a list of Trajectory objects from a list of Calculation s: get_trajectories_from_calculations This may be a viable alternative to the loop above depending on the input data.

Depending on the size of the source tasks collection, this process can be time and cpu intensive and may need additional batch processing logic.

Last updated

Was this helpful?