In Kedro, we can pipeline different nodes and partially run some nodes. When we are partially running some nodes, we need to save some inputs from the nodes somewhere so that when another node is run it can access the data that the previous node has generated. However, in which file do we write the code for this - pipeline.py, run.py or nodes.py?
For instance, I am trying to save a dir path directly to the DataCatalog under a variable name 'model_path'.
Snippet from pipeline.py:
# A mapping from a pipeline name to a ``Pipeline`` object.
def create_pipelines(**kwargs) -> Dict[str, Pipeline]:
io = DataCatalog(dict(
model_path=MemoryDataSet()
))
io.save('model_path', "data/06_models/model_test")
print('****', io.exists('model_path'))
pipeline = Pipeline([
node(
split_files,
["data_csv", "parameters"],
["train_filenames", "val_filenames", "train_labels", "val_labels"],
name="splitting filenames"
),
# node(
# create_and_train,
# ["train_filenames", "val_filenames", "train_labels", "val_labels", "parameters"],
# "model_path",
# name="Create Dataset, Train and Save Model"
# ),
node(
validate_model,
["val_filenames", "val_labels", "model_path"],
None,
name="Validate Model",
)
]).decorate(decorators.log_time, decorators.mem_profile)
return {
"__default__": pipeline
}
However, I get the following error when I Kedro run:
ValueError: Pipeline input(s) {'model_path'} not found in the DataCatalog
Node inputs are automatically loaded by Kedro from the DataCatalog
before being passed to the node function. Node outputs are consequently saved to the DataCatalog after the node successfully produces some data. DataCatalog configuration by default is taken from conf/base/catalog.yml
.
In your example model_path
is produced by Create Dataset, Train and Save Model
node and then consumed by Validate Model
. If required dataset definition is not found in the conf/base/catalog.yml
, Kedro will try to store this dataset in memory using MemoryDataSet
. This will work if you run the pipeline that contains both Create Dataset...
and Validate Model
nodes (given no other issues arise). However, when you are trying to run Validate Model
node alone, Kedro attempts to read model_path
dataset from memory, which doesn't exist there.
So, TLDR:
To mitigate this, you need to:
a) persist model_path
by adding something like the following to your conf/base/catalog.yml
:
model_path:
type: TextLocalDataSet
filepath: data/02_intermediate/model_path.txt
b) run Create Dataset, Train and Save Model
node (and its dependencies) at least once
After completing a) and b) you should be able to start running Validate Model
separately.