Task Collection Migrations
The schema of the task documents generated during calculation parsing has changed over time as the underlying workflows and workflow management software have evolved. Below are the migrations that were required for migrating the Materials Project's 10+ year old core task collection from the TaskDocument produced by atomate's VaspDrone, to emmet-core's TaskDoc (used by atomate2), and finally to the minified emmet-core CoreTaskDoc (the schema of documents returned by the tasks endpoint of the mp-api client).
Server-Side migrations
A few simple field re-names and type coercions were necessary during the transition from TaskDocument to TaskDoc that require no client-side data manipulations. These operations can be safely applied prior to migrating to TaskDoc to make the process smoother. These operations can also be applied to an existing TaskDoc collection.
Migration of Potcar Symbols for
orig_inputsDuring the process of updating the
CalculationInputclass inemmet-core, a number of inconsistencies were found for thepotcarfield oforig_inputs(ref: emmet PR comment). These updates will address those inconsistencies:
from pymongo import UpdateMany
ops = [
# Migrate any erroneously parsed emmet.core.vasp.calculation.PotcarSpec's
# to the correct location
UpdateMany(
{
"$and": [
{"orig_inputs.potcar": {"$type": "array"}},
{"orig_inputs.potcar.0": {"$type": "object"}},
]
},
{
"$rename": {"orig_inputs.potcar": "orig_inputs.potcar_spec"},
},
),
# Migrate emmet.core.Potcar (deprecated) struct -> list of potcar symbols
UpdateMany(
{"orig_inputs.potcar": {"$type": {"object"}}},
{"$set": {"orig_inputs.potcar": {"$orig_inputs.potcar.symbols"}}},
),
]Migration of
has_vasp_completedThe transition from
atomatetoatomate2resulted in thehas_vasp_completedfield changing from a boolean to an enumeration. These operations will correct any mis-typed fields:
Removal of dropped
inputandorig_inputsfieldsA number of fields were dropped in favor turning these values into
@propertys on the underlyingemmet.core.vasp.Calculationclass (emmet-core#1226). These fields can be safely removed:
The preceding operations can be executed individually, or concatenated and executed all at once. Consult the MongoDB bulk write documentation for examples of executing bulk writes.
Client-Side Migrations
The following migrations involve more complicated manipulations and can have long run times depending on the size of the source tasks collection.
Migration of
TaskDocumenttoTaskDocThe following
create_new_taskdocfunction can be used to transform aTaskDocumentinto a document with theTaskDocschema. Coordination of database operations in this situation are highly dependent on the execution environment and are thus left to the user.
Migration of TaskDoc to CoreTaskDoc
TaskDoc to CoreTaskDocThis is a multi-step process that should be done in order
Flattening
calcs_reversedPrior to
atomate2, workflows would output multiple calculations into a single directory, leading to the need for a field likecalcs_reversedthat would allow for parsing multiple calculations in a single directory into a single task document.atomate2has done away with this and now even for complex, multi-step workflows each individual calculation has its own output directory. Extracting the individual calculations fromcalcs_reversedis straightforward:
If the source_collection -> target_collection pattern is followed, also be sure to copy all documents from source_collection with len(calcs_reversed) == 1 and set the calcs_reversed field equal to calcs_reversed[0] .
"Parsing" extracted entries from
calcs_reversedThe
calcs_reversedentries from step 1 will need to be transformed into their ownTaskDocdocuments:
Flattened "
TaskDoc" toCoreTaskDocAnother core difference in
CoreTaskDocis the removal of the calculation's "trajectory" (energies/forces tracked across the ionic steps in the calculation: ionic_steps) from the database entries to be stored externally. The size of theionic_stepsfield can vary drastically across calculations and was found to be a major contributor to the on-disk storage size of MP's core task collection. See a full discussion here: emmet PR #1232.This snippet uses pyarrow to store the trajectories as parquet files as part of a pyarrow Dataset. Alternative storage formats can be substituted in as well
Depending on the size of the source tasks collection, this process can be time and cpu intensive and may need additional batch processing logic.
Last updated
Was this helpful?