scala hadoop amazon-s3 apache-spark parquet

Convert csv.gz files into Parquet using Spark

I need to implement converting csv.gz files in a folder, both in AWS S3 and HDFS, to Parquet files using Spark (Scala preferred). One of the columns of the data is a timestamp and I only have a week of dataset. The timestamp format is:

'yyyy-MM-dd hh:mm:ss'

The output that I desire is that for every day, there is a folder (or partition) where the Parquet files for that specific date is located. So there would 7 output folders or partitions.

I only have a faint idea of how to do this, only sc.textFile is on my mind. Is there a function in Spark that can convert to Parquet? How do I implement this in S3 and HDFS?

Thanks for you help.

Solution

If you look into the Spark Dataframe API, and the Spark-CSV package, this will achieve the majority of what you're trying to do - reading in the CSV file into a dataframe, then writing the dataframe out as parquet will get you most of the way there.

You'll still need to do some steps on parsing the timestamp and using the results to partition the data.