Search code examples
pysparkhive

Create hive managed partitioned table through pyspark dataframe and append data for everyrun


I have a spark dataframe based on which I am trying to create a partitioned table in hive.

I have a flag to say if table exists or not. The first run should create the table and from second run onwards the data should be inserted into the table without overwriting existing data.

My question is how to create a partitioned table and insert into the already existing partitioned table without overriding existing data.

The table is partitioned by a column called date.

What I tried so far.(without partitions)

df.createOrReplaceTempView("df_view")
if table_exists:
   spark.sql("insert into mytable select * from df_view")
else:
   spark.sql("create table if not exists mytable as select * from df_view")

But I have to do the same with partitioned column - date.

There can be multiple runs for same date. So is it possible to append the data to same partition instead of overwriting it.

Expected output: After 1st run: table should be created with partition column as date.

Name date        timestamp
A.   2021-09-16  2021-09-16 12:00:01
B.   2021-09-16  2021-09-16 12:00:01

After second run on same date:(data should get appended to same partition)

Name date        timestamp
A.   2021-09-16  2021-09-16 12:00:01
B.   2021-09-16  2021-09-16 12:00:01
A.   2021-09-16  2021-09-16 12:20:01
B.   2021-09-16  2021-09-16 12:20:01

Third run on next date: (new partition should be created by retaining all existing data)

Name date        timestamp
A.   2021-09-16  2021-09-16 12:00:01
B.   2021-09-16  2021-09-16 12:00:01
A.   2021-09-16  2021-09-16 12:20:01
B.   2021-09-16  2021-09-16 12:20:01
A.   2021-09-17  2021-09-17 12:20:01
B.   2021-09-17  2021-09-17 12:20:01

How to achieve this in Pyspark.


Solution

  • Following the documentation your code might be something like this:

    df.write.saveAsTable('[table_name_here]', 
                         format='[format_here]', 
                         mode='append', 
                         partitionBy='date')
    

    This code there's no needed to check if table exists, append automatically creates if not exists.