Search code examples
pythonkedro

Convert csv into parquet in kedro


I have pretty big CSV that would not fit into memory, and I need to convert it into .parquet file to work with vaex.

Here is my catalog:

raw_data:
    type: kedro.contrib.io.pyspark.SparkDataSet
    filepath: data/01_raw/data.csv
    file_format: csv

parquet_data:
    type: ParquetLocalDataSet
    filepath: data/02_intermediate/data.parquet

node:

def convert_to_parquet(data: SparkDataSet) -> ParquetLocalDataSet:
    return data.coalesce(1)

and a pipeline:

def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=convert_to_parquet,
                inputs="raw_data",
                outputs="parquet_data",
                name="data_to_parquet",
            ),
        ]
    )

But if I do kedro run I receive this error kedro.io.core.DataSetError: Failed while saving data to data set ParquetLocalDataSet(engine=auto, filepath=data/02_intermediate/data.parquet, save_args={}). 'DataFrame' object has no attribute 'to_parquet'

What should I fix to get my dataset converted?


Solution

  • You could try the following. This has worked for me in the past.

    parquet_data:
        type: kedro.contrib.io.pyspark.SparkDataSet
        file_format: 'parquet'
        filepath: data/02_intermediate/data.parquet
        save_args: