Search code examples
apache-sparkpysparkspark-streamingdelta-live-tables

Copy of Incremental source table with Spark


A source table in an SQL DB increments (new rows) every second.

I want to run some spark code (maybe with Structured Streaming?) once per day (it is okay if the copy is at most 1-day outdated), to append the new rows since the last time I ran the code. The copy would be a delta table on Databricks.

I'm not sure spark.readStream will work since the source table is not delta, rather JDBC (SQL)


Solution

  • Structured Streaming doesn't support JDBC source: link

    If you have a strictly increasing column in your source table, you can read it in batch mode and store your progress in the userMetadata in your target Delta table link.