Saving data with DataCatalog

I was looking at iris project example provided by kedro. Apart from logging the accuracy I also wanted to save the predictions and test_y as a csv.

This is the example node provided by kedro.

def report_accuracy(predictions: np.ndarray, test_y: pd.DataFrame) -> None:
    """Node for reporting the accuracy of the predictions performed by the
    previous node. Notice that this function has no outputs, except logging.
    """
    # Get true class index
    target = np.argmax(test_y.to_numpy(), axis=1)
    # Calculate accuracy of predictions
    accuracy = np.sum(predictions == target) / target.shape[0]
    # Log the accuracy of the model
    log = logging.getLogger(__name__)
    log.info("Model accuracy on test set: %0.2f%%", accuracy * 100)

I added the following to save the data.

data = pd.DataFrame({"target": target , "prediction": predictions})
data_set = CSVDataSet(filepath="data/test.csv")
data_set.save(data)

This works as intended, however, my question is "is it the kedro way of doing thing" ? Can I provide the data_set in catalog.yml and later save data to it? If I want to do it, how do I access the data_set from catalog.yml inside a node.

Is there a way to save data without creating a catalog inside a node like this data_set = CSVDataSet(filepath="data/test.csv") ? I want this in catalog.yml, if possible and if it follows kedro convention!.

Solution

Kedro actually abstracts this part for you. You don't need to access the datasets via their Python API.

Your report_accuracy method does need to be tweaked to return the DataFrame instead of None.

Your node needs to be defined as such:

node(
  func=report_accuracy,
  inputs='dataset_a',
  outputs='dataset_b'
)

Kedro then looks at your catalog and will load/save dataset_a and dataset_b as required:

dataset_a:
   type: pandas.CSVDataSet
   path: xxxx.csv

dataset_b:
   type: pandas.ParquetDataSet
   path: yyyy.pq

As you run the node/pipeline Kedro will handle the load/save operations for you. You also don't need to save every dataset if it's only used mid-way in a pipeline, you can read about MemoryDataSets here.