I'm trying to figure out how to store intermediate Kedro pipeline objects both locally AND on S3. In particular, say I have a dataset on S3:
my_big_dataset.hdf5:
type: kedro.extras.datasets.pandas.HDFDataSet
filepath: "s3://my_bucket/data/04_feature/my_big_dataset.hdf5"
I want to refer to these objects in the catalog by their S3 URI so that my team can use them. HOWEVER, I want to avoid re-downloading the datasets, model weights, etc. every time I run a pipeline by keeping a local copy in addition to the S3 copy. How do I mirror files with Kedro?
This is a good question, Kedro has CachedDataSet
for caching datasets within the same run, which handles caching the dataset in memory when it's used/loaded multiple times in the same run. There isn't really the same thing that persists across runs, in general Kedro doesn't do much persistent stuff.
That said, off the top of my head, I can think of two options that (mostly) replicates or gives this functionality:
catalog
in the same config environment but with the TemplatedConfigLoader
where your catalog datasets have their filepaths looking something like:my_dataset:
filepath: ${base_data}/01_raw/blah.csv
and you set base_data
to s3://bucket/blah
when running in "production" mode and with local_filepath/data
locally. You can decide how exactly you do this in your overriden context
method (whether it's using local/globals.yml
(see the linked documentation above) or environment variables or what not.
local
(it's kind of what it was made for!) where you keep a separate copy of your catalog where the filepaths are replaced with local ones.Otherwise, your next best bet is to write a PersistentCachedDataSet
similar to CachedDataSet
which intercepts the loading/saving for the wrapped dataset and makes a local copy when loading for the first time in a deterministic location that you look up on subsequent loads.