This is my first question on Stackoverflow.
I am replicating a SAS codebase in Pyspark. The SAS codebase produces and stores scores of intermediate SAS datasets (100 when I last counted) which are used to cross check the final output and also for other analyses at a later point in time.
My purpose is to save numerous Pyspark dataframes in some format so that they can be re-used in a separate Pyspark session. I have thought of 2 options:
Are there any other formats? Which method is faster? Will parquet files or csv files have schema related issues while re-reading the files as Pyspark dataframes?
The best option is to use parquet files as they have following advantages:
The only issue is make sure you are not generating multiple small files, the default parquet block size is 128 mb so make sure you have files sufficiently large. You can repartition the data to make sure the file size is large enough