Search code examples
pythonintake

Intake: catalogue level parameters


I am reading about "parameters" here and wondering whether I can define catalogue level parameters that I can later use in the definition of the catalogue's sources?

Consider a simple YAML-catalogue with two sources:

sources:
  data1:
    args:
      urlpath: "{{CATALOG_DIR}}/data/{{snapshot_date}}/data1.csv"
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}
    
  data2:
    args:
      urlpath: "{{CATALOG_DIR}}/data/{{snapshot_date}}/data2.csv"
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}

Note that both data sources (data1 and data2) make use of snapshot_date parameter inside urlpath argument? With this definition I can load data sources with:

cat = intake.open_catalog("./catalog.yaml")
cat.data1(snapshot_date="latest").read()   # reads from data/latest/data1.csv
cat.data2(snapshot_date="20211029").read() # reads from data/20211029/data2.csv

Please note that cat.data1().read() will not work, since snapshot_date defaults to empty string, so the csv driver cannot find the path "./data//data1.csv".

I can set the default value by adding parameters section to every (!) source like in the below.

sources:
  data1:
    parameters:
      snapshot_date:
        type: str
        default: "latest"
        description: ""
    args:
      urlpath: "{{CATALOG_DIR}}/data/{{snapshot_date}}/data1.csv"
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}
    
  data2:
    parameters:
      snapshot_date:
        type: str
        default: "latest"
        description: ""
    args:
      urlpath: "{{CATALOG_DIR}}/data/{{snapshot_date}}/data2.csv"
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}

But this looks complicated (too much repetitive code) and a little inconvenient for the end user -- if a user wants to load all data sources from a given date, he has to explicitly provide snapshot_date parameter to every(!) data source at initialization. IMO, it would be nice I user can provide this value once when initializing the catalog.

Is there a way I can define snapshot_date parameter at catalog level? So that:

  • I can set default value (e.g. "latest" in my example) in the YAML-definition of the catalogue's parameter
  • or can pass catalogue's parameter value at runtimeduring the call intake.open_catalog("./catalog.yaml", snapshot_date="20211029")
  • this value should be accessible in the definition of data sources of this catalog ?
cat = intake.open_catalog("./catalog.yaml", snapshot_date="20211029")
cat.data1.read()  # will return data from ./data/20211029/data1.csv
cat.data2.read()  # will return data from ./data/20211029/data2.csv
cat.data2(snapshot_date="latest").read()  # will return data from ./data/latest/data1.csv

cat = intake.open_catalog("./catalog.yaml")
cat.data1.read()  # will return data from ./data/latest/data1.csv
cat.data2.read()  # will return data from ./data/latest/data2.csv

Thanks in advance


Solution

  • This idea has been suggested before ( https://github.com/intake/intake/pull/562 , https://github.com/intake/intake/issues/511 ), and I have an inkling that maybe https://github.com/zillow/intake-nested-yaml-catalog supports something like you are asking.

    However, I fully support adding this functionality in Intake, either based on #562, above, or otherwise. Adding it to the base Catalog and YAML file(s) catalog should be easy, but doing it so that it works for all subclasses might be tricky.

    Currently, you can achieve what you want using environment variables, e.g., "{{snapshot_date}}"->"{{env(SNAPSHOT_DATE)}}", but you would ned to communicate to the user that this variable should be set. In addition, if the value is not to be used within a string, you would still need a parameter definition to cast to the right type.