apache-spark pyspark spark-streaming delta-live-tables

Copy of Incremental source table with Spark

A source table in an SQL DB increments (new rows) every second.

I want to run some spark code (maybe with Structured Streaming?) once per day (it is okay if the copy is at most 1-day outdated), to append the new rows since the last time I ran the code. The copy would be a delta table on Databricks.

I'm not sure spark.readStream will work since the source table is not delta, rather JDBC (SQL)

Solution

Structured Streaming doesn't support JDBC source: link

If you have a strictly increasing column in your source table, you can read it in batch mode and store your progress in the userMetadata in your target Delta table link.