Search code examples
apache-sparkdatabricksspark-structured-streamingdelta-lake

How reliable is spark stream join with static databricks delta table


In the databricks there is a cool feature that allows to join a streaming dataframe with a delta table. The cool part is that changes in the delta table are still reflected for a subsequent join results. It works just fine, but I'm curious to know how this works, and what are the limitations here? e.g. what's the expected update delay? How it changes as the delta table grows? Is it safe to rely on it in production?


Solution

  • Yes, you can rely on this feature (it's really of Spark) - many customers are using it in production. Regarding the other questions - there are multiple aspects here, depending on factors, like, how often table updates, etc.:

    • Because static Delta table isn't cached it's re-read on each join - depending on the cluster configuration, it may not be very bad if you use Delta Caching, so files aren't re-downloaded every time, only new data will be re-downloaded.
    • Read performance could be affected if you have a lot of small files, etc. - it depends on how you're writing into that table & if you do things like OPTIMIZE.
    • Depending on how often the Delta table is updated, you can cache it & periodically refresh it

    But really to answer it completely, you need to provide more information specific to your code, use case, etc.