pyspark databricks delta-lake databricks-autoloader

pyspark - will partition option in autoloader->writesteam partitioned for existing table data?

i used autoloader to read data file and write it to table periodically(without partition at first) by below code:

.writeStream\
.option("checkpointLocation", "path") \
.format("delta")\
.outputMode("append")\
.start("table")

Now data size is growing, and want to partition the data with adding this option " .partitionBy("col1") "

.writeStream\
.option("checkpointLocation", "path") \
.partitionBy("col1")\
.format("delta")\
.outputMode("append")\
.start("table")

I want to ask if this option partitionBy("col1") will partition the existing data in the table? If not, how to partition all the data (include existing data and new data ingested)

Solution

No, it wont' partition existing data automatically, you will need to do it explicitly. Something like this, test first on a small dataset:

Stop stream if it's running continuously
Read existing data and overwrite with the new partitioning schema

spark.read.table("table") \
  .write.mode("overwrite")\
  .partitionBy("col1")\
  .option("overwriteSchema", "true") \
  .saveAsTable("table")

Start stream again