Search code examples
pythonpipcondakedro

AttributeError: Object ParquetDataSet cannot be loaded from kedro.extras.datasets.pandas


I'm quite new using Kedro and after installing kedro in my conda environment, I'm getting the following error when trying to list my catalog:

Command performed: kedro catalog list

Error:

kedro.io.core.DataSetError: An exception occurred when parsing config for DataSet df_medinfo_raw: Object ParquetDataSet cannot be loaded from kedro.extras.datasets.pandas. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.pandas.ParquetDataSet:

I installed kedro trough conda-forge: conda install -c conda-forge "kedro[pandas]". As far as I understand, this way to install kedro also installs the pandas dependencies.

I tried to read the kedro documentation for dependencies, but it's not really clear how to solve this kind of issue.

My kedro version is 0.17.6.


Solution

  • Kedro uses Pandas to load ParquetDataSet objects, and Pandas requires additional dependencies to accomplish this (see "Installation: Other data sources"). That is, in addition to Pandas, one must also install either fastparquet or pyarrow.

    For Conda you either want:

    ## use pyarrow for parquet
    conda install -c conda-forge kedro pandas pyarrow
    

    or

    ## or use fastparquet for parquet
    conda install -c conda-forge kedro pandas fastparquet
    

    Note that the syntax used in the question kedro[pandas] is meaningless to Conda (i.e., it ultimately parses to just kedro). Conda package specification uses a custom grammar called MatchSpec, where anything inside a [...] is parsed for a [key1=value1;key2=value2;...] syntax. Essentially, the [pandas] is treated as an unknown key, which is ignored.