I have one delta table name "ali" . I have read stream from that delta table :
from pyspark.sql.functions import col
streamDF = spark.readStream.format('delta').load('dbfs:/user/hive/warehouse/pdf/ali')
display(streamDF)
Now I want write my data stream into my silver delta table :
(streamDF
.writeStream
.format("delta")
.option("checkpointLocation" ,"dbfs:/user/hive/warehouse/pdf/ali") #allows us to pick up where we left off if we lose connectivity
.outputMode("append") # appends data to our table
.option("mergeSchema","true")
.start('dbfs:/tmp/hive/ali4') )
Spark job continuously in running stage :
What should I do ?
Thanks
These are streaming jobs and will be in running mode till you manually cancel it. It will stream the data from source delta (Say bronze) and it will write to destination delta table (Say Silver).
You will see a graph showing whats the input rate and whats the output rate.
This page has more info around streaming UI of spark services - https://www.databricks.com/blog/2020/07/29/a-look-at-the-new-structured-streaming-ui-in-apache-spark-3-0.html
If you want to validate, you can perform below.
-> Open a new notebook instead of adding a new cell below your write stream code and add a new cell like this - spark.read.format("delta").load("silver location")
-> ingest some sample data to your source delta table (In your case it would be bronze delta location)
-> Run the cell you created in first step where you are reading the data from silver and you should see the newly ingested data flowing into silver(destination) from bronze location(Source).