Search code examples
pythoncsvintake

How best to create an intake catalog from a collection of CSV files?


I'm trying to figure out the best way to create an intake catalog from a collection of CSV files, where I want each CSV file to be an individual source.

I can create a catalog.yml for one CSV by doing:

import intake
source1 = intake.open_csv('states_1.csv')
source1.name = 'states1'
with open('catalog.yml', 'w') as f:
    f.write(str(source1.yaml()))

which produces the valid:

sources:
  states1:
    args:
      urlpath: states_1.csv
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}

but if I do

import intake
source1 = intake.open_csv('states_1.csv')
source1.name = 'states1'
source2 = intake.open_csv('states_2.csv')
source2.name = 'states2'
with open('catalog.yml', 'w') as f:
    f.write(str(source1.yaml()))
    f.write(str(source2.yaml()))

of course this fails because the catalog has a duplicate sources entry:

sources:
  states1:
    args:
      urlpath: states_1.csv
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}
sources:
  states2:
    args:
      urlpath: states_2.csv
    description: ''
    driver: intake.source.csv.CSVSource
    metadata: {}

I'm guessing there must be a better way to go about this, like perhaps by instantiating a catalog object, adding source objects and then writing the catalog? But I couldn't find the methods to accomplish this.

What is the best practice for accomplishing this?


Solution

  • Try using intake.Catalog() and adding your sources to them.

    import intake
    
    description = "Simple catalog for multiple CSV sources"
    catalog = {'metadata': {'version': 1,'description': description},'sources': {}}
    with open('catalog.yml', 'w') as f:
        yaml.dump(catalog, f)
    
    # Create a catalog object
    catalog = intake.open_catalog('catalog.yml')
    
    # Define your CSV sources
    source1 = intake.open_csv('states_1.csv')
    source1.name = 'states1'
    source2 = intake.open_csv('states_2.csv')
    source2.name = 'states2'
    
    # Add the sources to the catalog
    catalog = catalog.add(source1)
    catalog = catalog.add(source2)
    
    catalog.save('catalog.yml')