I am reading about "parameters" here and wondering whether I can define catalogue level parameters that I can later use in the definition of the catalogue's sources?
Consider a simple YAML-catalogue with two sources:
sources:
data1:
args:
urlpath: "{{CATALOG_DIR}}/data/{{snapshot_date}}/data1.csv"
description: ''
driver: intake.source.csv.CSVSource
metadata: {}
data2:
args:
urlpath: "{{CATALOG_DIR}}/data/{{snapshot_date}}/data2.csv"
description: ''
driver: intake.source.csv.CSVSource
metadata: {}
Note that both data sources (data1 and data2) make use of snapshot_date
parameter inside urlpath
argument? With this definition I can load data sources with:
cat = intake.open_catalog("./catalog.yaml")
cat.data1(snapshot_date="latest").read() # reads from data/latest/data1.csv
cat.data2(snapshot_date="20211029").read() # reads from data/20211029/data2.csv
Please note that cat.data1().read()
will not work, since snapshot_date
defaults to empty string, so the csv driver cannot find the path "./data//data1.csv".
I can set the default value by adding parameters
section to every (!) source like in the below.
sources:
data1:
parameters:
snapshot_date:
type: str
default: "latest"
description: ""
args:
urlpath: "{{CATALOG_DIR}}/data/{{snapshot_date}}/data1.csv"
description: ''
driver: intake.source.csv.CSVSource
metadata: {}
data2:
parameters:
snapshot_date:
type: str
default: "latest"
description: ""
args:
urlpath: "{{CATALOG_DIR}}/data/{{snapshot_date}}/data2.csv"
description: ''
driver: intake.source.csv.CSVSource
metadata: {}
But this looks complicated (too much repetitive code) and a little inconvenient for the end user -- if a user wants to load all data sources from a given date, he has to explicitly provide snapshot_date
parameter to every(!) data source at initialization. IMO, it would be nice I user can provide this value once when initializing the catalog.
Is there a way I can define snapshot_date
parameter at catalog level? So that:
intake.open_catalog("./catalog.yaml", snapshot_date="20211029")
cat = intake.open_catalog("./catalog.yaml", snapshot_date="20211029")
cat.data1.read() # will return data from ./data/20211029/data1.csv
cat.data2.read() # will return data from ./data/20211029/data2.csv
cat.data2(snapshot_date="latest").read() # will return data from ./data/latest/data1.csv
cat = intake.open_catalog("./catalog.yaml")
cat.data1.read() # will return data from ./data/latest/data1.csv
cat.data2.read() # will return data from ./data/latest/data2.csv
Thanks in advance
This idea has been suggested before ( https://github.com/intake/intake/pull/562 , https://github.com/intake/intake/issues/511 ), and I have an inkling that maybe https://github.com/zillow/intake-nested-yaml-catalog supports something like you are asking.
However, I fully support adding this functionality in Intake, either based on #562, above, or otherwise. Adding it to the base Catalog and YAML file(s) catalog should be easy, but doing it so that it works for all subclasses might be tricky.
Currently, you can achieve what you want using environment variables, e.g., "{{snapshot_date}}"->"{{env(SNAPSHOT_DATE)}}", but you would ned to communicate to the user that this variable should be set. In addition, if the value is not to be used within a string, you would still need a parameter definition to cast to the right type.