Search code examples
pythonkedro

Saving data with DataCatalog


I was looking at iris project example provided by kedro. Apart from logging the accuracy I also wanted to save the predictions and test_y as a csv.

This is the example node provided by kedro.

def report_accuracy(predictions: np.ndarray, test_y: pd.DataFrame) -> None:
    """Node for reporting the accuracy of the predictions performed by the
    previous node. Notice that this function has no outputs, except logging.
    """
    # Get true class index
    target = np.argmax(test_y.to_numpy(), axis=1)
    # Calculate accuracy of predictions
    accuracy = np.sum(predictions == target) / target.shape[0]
    # Log the accuracy of the model
    log = logging.getLogger(__name__)
    log.info("Model accuracy on test set: %0.2f%%", accuracy * 100)

I added the following to save the data.

data = pd.DataFrame({"target": target , "prediction": predictions})
data_set = CSVDataSet(filepath="data/test.csv")
data_set.save(data)

This works as intended, however, my question is "is it the kedro way of doing thing" ? Can I provide the data_set in catalog.yml and later save data to it? If I want to do it, how do I access the data_set from catalog.yml inside a node.

Is there a way to save data without creating a catalog inside a node like this data_set = CSVDataSet(filepath="data/test.csv") ? I want this in catalog.yml, if possible and if it follows kedro convention!.


Solution

  • Kedro actually abstracts this part for you. You don't need to access the datasets via their Python API.

    Your report_accuracy method does need to be tweaked to return the DataFrame instead of None.

    Your node needs to be defined as such:

    node(
      func=report_accuracy,
      inputs='dataset_a',
      outputs='dataset_b'
    )
    

    Kedro then looks at your catalog and will load/save dataset_a and dataset_b as required:

    dataset_a:
       type: pandas.CSVDataSet
       path: xxxx.csv
    
    dataset_b:
       type: pandas.ParquetDataSet
       path: yyyy.pq
    

    As you run the node/pipeline Kedro will handle the load/save operations for you. You also don't need to save every dataset if it's only used mid-way in a pipeline, you can read about MemoryDataSets here.