How to process huge datasets in kedro

I have pretty big (~200Gb, ~20M lines) raw jsonl dataset. I need to extract important properties from there and store the intermediate dataset in csv for further conversion into something like HDF5, parquet, etc. Obviously, I can't use JSONDataSet for loading raw dataset, because it utilizes pandas.read_json under the hood, and using pandas for the dataset of such size sounds like a bad idea. So I'm thinking about reading the raw dataset line by line, process and append processed data line by line to the intermediate dataset.

What I can't understand is how to make this compatible with AbstractDataSet with its _load and _save methods.

P.S. I understand I can move this out of kedro's context, and introduce preprocessed dataset as a raw one, but that kinda breaks the whole idea of complete pipelines.

Solution

Try to use pyspark to leverage lazy evaluation and batch execution. SparkDataSet is implemented in kedro.contib.io.spark_data_set

Sample catalog config for jsonl:

your_dataset_name:   
  type: kedro.contrib.io.pyspark.SparkDataSet
  filepath: "\file_path"
  file_format: json
  load_args:
    multiline: True