Search code examples
intake

Persisting only part of a data source


I'm using intake to access the catalog catalog.ocean.GFDL_CM2_6.GFDL_CM2_6_control_ocean_surface. At the moment I only work with small patches of that data, but accessing that data every single time is still quite costly (it's on Google Cloud Storage). So I want to use the persist option of intake to store that data locally. However as far as I've understood from the docs, it looks like one can only persist the whole dataset. For that specific dataset that would amount to almost 400 dollars if I take a cost of 0.1$ per GB, since the total data is 3976GB.

Hence my questions:

  1. Is there a way (especially for a zarr file which in theory should make this quite easy) to persist only parts of the data (for instance only a subset of the variables)
  2. This is probably more complicated, but can I push things further, by persisting regions of data I'm interested in (in terms of coordinates values for instance)?

Solution

  • There is no direct Intake way to do what you are asking for. Intake was conceived as a way to get your data into a format that you can then manipulate as you normally do, i.e., deal with only the loading part, so that a persisted data-set is the same as the original.

    However, it is not hard to accomplish manually: you should grab the xarray, filter for the region you need, and call to_zarr to save the new dataset. You can then point a simple catalogue entry like the old one at the new location.

    You could have done this manipulation in a driver directly if this was a specific pattern that would repeat a lot. In fact, we have mooted the idea of whether/how to implement such processing steps in Intake, but there is no plan yet. In the end, we may take the work on pipelines in Holoviews to describe processing steps.