I am seeing a situation where when save a pyspark dataframe to a hive table with multiple column partition, it overwrites the data in subpartition too. Or - may be I am assuming it is a subpartition.
I want to treat the column 'month' as subpartition. So that, I can see 4 records (in hive table) instead of 2 when I save df2 to the same table.
mode=append
will work. But, if year & month are same, I want data to be overwritten. Is there a way to do this when saving a pyspark dataframe?
>>> df1 = spark.sql('select * from test_input')
>>> df1.show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
| a| b|2018| 01|
| c| d|2018| 01|
+---+---+----+-----+
>>> df1.write.saveAsTable('test_output',mode='overwrite',partitionBy=('year','month'))
>>> spark.sql('select * from test_output').show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
| a| b|2018| 01|
| c| d|2018| 01|
+---+---+----+-----+
>>> df2 = spark.sql('select * from test_input')
>>> df2.show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
| a| b|2018| 02|
| c| d|2018| 02|
+---+---+----+-----+
>>> df2.write.saveAsTable('test_output',mode='overwrite',partitionBy=('year','month'))
>>> spark.sql('select * from test_output').show()
+---+---+----+-----+
| f1| f2|year|month|
+---+---+----+-----+
| a| b|2018| 02|
| c| d|2018| 02|
+---+---+----+-----+
It seems like you misunderstand the concept of partitioning.
This is not window function partitioning that you would come across in SQL statement; it instead refers to the way data is stored and referenced in memory or on a file system. Here's a helpful introduction.
Changing the partitioning of a Spark dataframe will never alter the number of rows in that dataframe.