Search code examples
yamlpyarrowapache-arrowkedro

Does kedro data catalog accept .arrow files?


While using Kedro I want to load some data and work with it. To do that, one has to register the data in a conf/base/catalog.yml file. The Kedro Documentation of the Data Catalog explains how one can register data for Kedro to load. However, there is little to no information on how to load a .arrow file.

In the conf/base/catalog.yml I tried to register my data thus:

dataframe:
  type: arrow.ArrowDataSet
  filepath: "home/place/data.arrow"
  layer : primary

And ofcourse tried on different combinations from the data catalog documentation mentioned above.
The error code I get is the following :
DataSetError: An exception occurred when parsing config for DataSet 'dataframe': Class 'arrow.ArrowDataSet' not found or one of its dependencies has not been installed.

I have ofcourse installed the arrow package in my environment.

Does the Kedro Data Catalog simply not accept .arrow files or is there a way to register such a format in the catalog.yml file?

Thanks in advance,

Jamal


Solution

  • Like said @0x26res, you can use the parquet dataset or others that kedro supports. Parquet could be handled in kedro by pyarrow engine because under the hood is pandas read_parquet with 2 engines and pyarrow by default.

    It may be necessary to install dependencies to use other dataset types:

    pip install kedro[pandas.ParquetDataSet]