Search code examples

Kedro - Getting path to item in the datacatalog

I'm training an nlp model using spacy. I have the preprocessing steps all written as a pipeline, and now I need to do the training. According to spacy's documentation I need to run the following command:

python -m spacy train config.cfg --output ./output --paths.train ./train.spacy ./dev.spacy

The files config.cfg, train.spacy and dev.spacy are all registered in my data catalog. I want to run this command with something similar to the following code:

import subprocess

def train_spacy_nlp_model(
    config_filepath: str, 
    train_filepath: str, 
    dev_filepath: str, 
    output_dir: str
    cmd = [
        "python -m", "spacy",
        "train", config_filepath,
        "--output", output_dir,
        "--paths.train", train_filepath,
        "", dev_filepath

    result =" ".join(cmd), shell=True)
    if result.returncode != 0:
        raise RuntimeError("Spacy training failed")

But I have no idea how to retrieve the file path information from the items in my data catalog, is there a way of passing this information to my nodes when creating the pipeline?


  • This is probably not the most elegant solution to this, but it works for me so I'll use it until I get a better solution. The solution was to return the path with the object on my DataSet implementation, I doubt that this would generalize for other datasets like SQL queries for example, but since I know that I have to be dealing with a file here, works fine. Here is my implementation:

    from import AbstractDataSet
    from spacy.tokens import DocBin
    from dataclasses import dataclass
    from typing import Union
    from pathlib import Path
    class DocBinModel:
        filepath: Path
        docbin: DocBin
    class SpacyDocBinDataSet(AbstractDataSet):
        def __init__(self, filepath, save_args=None, load_args=None):
            self._filepath = filepath
            self._save_args = save_args or {}
            self._load_args = load_args or {}
        def _describe(self):
            return dict(
        def _load(self):
            with open(self._filepath, "rb") as f:
                docbin = DocBin().from_bytes(
            return DocBinModel(self._filepath, docbin)
        def _save(self, data: Union[DocBin, DocBinModel]):
            if isinstance(data, DocBinModel):
                data = data.docbin
        def _exists(self):
            return Path(self._filepath).exists()