Search code examples
pythontensorflowkedro

Where to perform the saving of an nodeoutput in Kedro?


In Kedro, we can pipeline different nodes and partially run some nodes. When we are partially running some nodes, we need to save some inputs from the nodes somewhere so that when another node is run it can access the data that the previous node has generated. However, in which file do we write the code for this - pipeline.py, run.py or nodes.py?

For instance, I am trying to save a dir path directly to the DataCatalog under a variable name 'model_path'.

Snippet from pipeline.py:

    # A mapping from a pipeline name to a ``Pipeline`` object.
def create_pipelines(**kwargs) -> Dict[str, Pipeline]:
io = DataCatalog(dict(
    model_path=MemoryDataSet()
))

io.save('model_path', "data/06_models/model_test")
print('****', io.exists('model_path'))

pipeline = Pipeline([
    node(
        split_files,
        ["data_csv", "parameters"],
        ["train_filenames", "val_filenames", "train_labels", "val_labels"],
        name="splitting filenames"
    ),
    # node(
    #     create_and_train,
    #     ["train_filenames", "val_filenames", "train_labels", "val_labels", "parameters"],
    #     "model_path",
    #     name="Create Dataset, Train and Save Model"
    # ),
    node(
        validate_model,
        ["val_filenames", "val_labels", "model_path"],
        None,
        name="Validate Model",
    )

]).decorate(decorators.log_time, decorators.mem_profile)

return {
    "__default__": pipeline
}

However, I get the following error when I Kedro run:

ValueError: Pipeline input(s) {'model_path'} not found in the DataCatalog

Solution

  • Node inputs are automatically loaded by Kedro from the DataCatalog before being passed to the node function. Node outputs are consequently saved to the DataCatalog after the node successfully produces some data. DataCatalog configuration by default is taken from conf/base/catalog.yml.

    In your example model_path is produced by Create Dataset, Train and Save Model node and then consumed by Validate Model. If required dataset definition is not found in the conf/base/catalog.yml, Kedro will try to store this dataset in memory using MemoryDataSet. This will work if you run the pipeline that contains both Create Dataset... and Validate Model nodes (given no other issues arise). However, when you are trying to run Validate Model node alone, Kedro attempts to read model_path dataset from memory, which doesn't exist there.

    So, TLDR:

    To mitigate this, you need to:

    a) persist model_path by adding something like the following to your conf/base/catalog.yml:

    model_path:
      type: TextLocalDataSet
      filepath: data/02_intermediate/model_path.txt
    

    b) run Create Dataset, Train and Save Model node (and its dependencies) at least once

    After completing a) and b) you should be able to start running Validate Model separately.