Search code examples
apache-sparkspark-structured-streamingspark-streaming-kafka

Spark Structured Streaming - Streaming data joined with static data which will be refreshed every 5 mins


For spark structured streaming job one input is coming from a kafka topic while second input is a file (which will be refreshed every 5 mins by a python API). I need to join these 2 inputs and write to a kafka topic.

The issue I am facing is when second input file is being refreshed and spark streaming job is reading the file at the same time I get the error below:

File file:/home/hduser/code/new/collect_ip1/part-00163-55e17a3c-f524-4dac-89a4-b9e12f1a79df-c000.csv does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by recreating the Dataset/DataFrame involved.

Any help will be appreciated.


Solution

  • Use HBase as your store for static. It is more work for sure but allows for concurrent updating.

    Where I work, all Spark Streaming uses HBase for lookup of data. Far faster. What if you have a 100M customers for a microbatch of 10k records? I know it was a lot of work initially.

    See https://medium.com/@anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc

    If you have a small static ref table, then static join is fine, but you also have updating, causing issues.