pyspark apache-spark-sql hive-partitions

PySpark - data overwritten in Partition

I am seeing a situation where when save a pyspark dataframe to a hive table with multiple column partition, it overwrites the data in subpartition too. Or - may be I am assuming it is a subpartition.

I want to treat the column 'month' as subpartition. So that, I can see 4 records (in hive table) instead of 2 when I save df2 to the same table.

mode=append will work. But, if year & month are same, I want data to be overwritten. Is there a way to do this when saving a pyspark dataframe?

>>> df1 = spark.sql('select * from test_input')
>>> df1.show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
|  a|  b|2018|   01|
|  c|  d|2018|   01|
+---+---+----+-----+

>>> df1.write.saveAsTable('test_output',mode='overwrite',partitionBy=('year','month'))
>>> spark.sql('select * from test_output').show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
|  a|  b|2018|   01|
|  c|  d|2018|   01|
+---+---+----+-----+

>>> df2 = spark.sql('select * from test_input')
>>> df2.show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
|  a|  b|2018|   02|
|  c|  d|2018|   02|
+---+---+----+-----+

>>> df2.write.saveAsTable('test_output',mode='overwrite',partitionBy=('year','month'))
>>> spark.sql('select * from test_output').show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
|  a|  b|2018|   02|
|  c|  d|2018|   02|
+---+---+----+-----+

Solution

It seems like you misunderstand the concept of partitioning.

This is not window function partitioning that you would come across in SQL statement; it instead refers to the way data is stored and referenced in memory or on a file system. Here's a helpful introduction.

Changing the partitioning of a Spark dataframe will never alter the number of rows in that dataframe.