I have hundreds of CSV files that I want to process similarly. For simplicity, we can assume that they are all in ./data/01_raw/
(like ./data/01_raw/1.csv
, ./data/02_raw/2.csv
) etc. I would much rather not give each file a different name and keep track of them individually when building my pipeline. I would like to know if there is any way to read all of them in bulk by specifying something in the catalog.yml
file?
You are looking for PartitionedDataSet. In your example, the catalog.yml
might look like this:
my_partitioned_dataset:
type: "PartitionedDataSet"
path: "data/01_raw"
dataset: "pandas.CSVDataSet"