Search code examples
pythonkedro

How to process huge datasets in kedro


I have pretty big (~200Gb, ~20M lines) raw jsonl dataset. I need to extract important properties from there and store the intermediate dataset in csv for further conversion into something like HDF5, parquet, etc. Obviously, I can't use JSONDataSet for loading raw dataset, because it utilizes pandas.read_json under the hood, and using pandas for the dataset of such size sounds like a bad idea. So I'm thinking about reading the raw dataset line by line, process and append processed data line by line to the intermediate dataset.

What I can't understand is how to make this compatible with AbstractDataSet with its _load and _save methods.

P.S. I understand I can move this out of kedro's context, and introduce preprocessed dataset as a raw one, but that kinda breaks the whole idea of complete pipelines.


Solution

  • Try to use pyspark to leverage lazy evaluation and batch execution. SparkDataSet is implemented in kedro.contib.io.spark_data_set

    Sample catalog config for jsonl:

    your_dataset_name:   
      type: kedro.contrib.io.pyspark.SparkDataSet
      filepath: "\file_path"
      file_format: json
      load_args:
        multiline: True