Search code examples
tensorflowkeraspysparkparquet

How can I open a large parquet file with Keras?


I've tried looking for this and haven't had any meaningful results.

I have a model that has multi input and my data was getting too large for my pandas approach so I preprocessed it and saved it parquet file. I'm not sure how to open it with keras.

I looked up tf.datasets but I still cannot figure out how to read a parquet file that I can pass to my model.

Does anyone know how to use open parquet files? I can't seem to figure out how to do this in tensorflow and can't find anything related to it in keras.


Solution

  • You can probably keep your pandas approach, but you would have to breakdown your data into chunks.

    If you have already broken it down to create your parquet file, you should be able to use the same method to have only a subset of your data opened in pandas at a time.

    If you need to extract the data from your parquet file here's a link on how to create chunks of data for a pandas dataframe: How to read a CSV file subset by subset with Pandas?

    Once you have a chunk of data you can call model.fit on that chunk of data and then go on to the next chunk and call model.fit