Search code examples
pandasapache-sparkparquetsnappy

How to append multiple parquet files to one dataframe in Pandas


I am working on decompressing snappy.parquet files with Spark and Pandas. I have 180 files (7GB of data in my Jupyter notebook). In my understanding, I need to create a loop to grab all the files - decompress them with Spark and append to Pandas table? Here is the code

findspark.init()

import pyspark 

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

parquetFile = spark.read.parquet("file_name.snappy.parquet")

parquetFile.createOrReplaceTempView("parquetFile")
file_output = spark.sql("SELECT * FROM parquetFile")
file_output.show()

pandas_df = file_output.select("*").toPandas()

This part works and I have my Pandas dataframe from one file, and I have another 180 files that I need to append to the pandas_df. Can anyone help me out? Thank you!


Solution

  • With Spark you can load a dataframe from a single file or from multiple files, only you need to replace your path of your single for a path of your folder (assuming that all of your 180 files are in the same directory).

    parquetFile = spark.read.parquet("your_dir_path/")