Search code examples
amazon-s3cachingdevopskedro

How to catalog datasets & models by S3 URI, but keep a local copy?


I'm trying to figure out how to store intermediate Kedro pipeline objects both locally AND on S3. In particular, say I have a dataset on S3:

my_big_dataset.hdf5:
  type: kedro.extras.datasets.pandas.HDFDataSet
  filepath: "s3://my_bucket/data/04_feature/my_big_dataset.hdf5"

I want to refer to these objects in the catalog by their S3 URI so that my team can use them. HOWEVER, I want to avoid re-downloading the datasets, model weights, etc. every time I run a pipeline by keeping a local copy in addition to the S3 copy. How do I mirror files with Kedro?


Solution

  • This is a good question, Kedro has CachedDataSet for caching datasets within the same run, which handles caching the dataset in memory when it's used/loaded multiple times in the same run. There isn't really the same thing that persists across runs, in general Kedro doesn't do much persistent stuff.

    That said, off the top of my head, I can think of two options that (mostly) replicates or gives this functionality:

    1. Use the same catalog in the same config environment but with the TemplatedConfigLoader where your catalog datasets have their filepaths looking something like:
    my_dataset:
      filepath: ${base_data}/01_raw/blah.csv
    

    and you set base_data to s3://bucket/blah when running in "production" mode and with local_filepath/data locally. You can decide how exactly you do this in your overriden context method (whether it's using local/globals.yml (see the linked documentation above) or environment variables or what not.

    1. Use separate environments, likely local (it's kind of what it was made for!) where you keep a separate copy of your catalog where the filepaths are replaced with local ones.

    Otherwise, your next best bet is to write a PersistentCachedDataSet similar to CachedDataSet which intercepts the loading/saving for the wrapped dataset and makes a local copy when loading for the first time in a deterministic location that you look up on subsequent loads.