Search code examples
hadooppysparkhivesasparquet

Best method to save intermediate tables in pyspark


This is my first question on Stackoverflow.

I am replicating a SAS codebase in Pyspark. The SAS codebase produces and stores scores of intermediate SAS datasets (100 when I last counted) which are used to cross check the final output and also for other analyses at a later point in time.

My purpose is to save numerous Pyspark dataframes in some format so that they can be re-used in a separate Pyspark session. I have thought of 2 options:

  1. Save dataframes as hive tables.
  2. Save them as parquet files.

Are there any other formats? Which method is faster? Will parquet files or csv files have schema related issues while re-reading the files as Pyspark dataframes?


Solution

  • The best option is to use parquet files as they have following advantages:

    1. 3x compressed saves space
    2. Columnar format, faster pushdowns
    3. Optimized with spark catalyst optimizer
    4. Schema persists as parquet contains schema related info.

    The only issue is make sure you are not generating multiple small files, the default parquet block size is 128 mb so make sure you have files sufficiently large. You can repartition the data to make sure the file size is large enough