Search code examples
pysparkdatabricksdelta-lakedatabricks-autoloader

pyspark - will partition option in autoloader->writesteam partitioned for existing table data?


i used autoloader to read data file and write it to table periodically(without partition at first) by below code:

.writeStream\
.option("checkpointLocation", "path") \
.format("delta")\
.outputMode("append")\
.start("table")

Now data size is growing, and want to partition the data with adding this option " .partitionBy("col1") "

.writeStream\
.option("checkpointLocation", "path") \
.partitionBy("col1")\
.format("delta")\
.outputMode("append")\
.start("table")

I want to ask if this option partitionBy("col1") will partition the existing data in the table? If not, how to partition all the data (include existing data and new data ingested)


Solution

  • No, it wont' partition existing data automatically, you will need to do it explicitly. Something like this, test first on a small dataset:

    • Stop stream if it's running continuously
    • Read existing data and overwrite with the new partitioning schema
    spark.read.table("table") \
      .write.mode("overwrite")\
      .partitionBy("col1")\
      .option("overwriteSchema", "true") \
      .saveAsTable("table")
    
    • Start stream again