For spark structured streaming job one input is coming from a kafka topic while second input is a file (which will be refreshed every 5 mins by a python API). I need to join these 2 inputs and write to a kafka topic.
The issue I am facing is when second input file is being refreshed and spark streaming job is reading the file at the same time I get the error below:
File file:/home/hduser/code/new/collect_ip1/part-00163-55e17a3c-f524-4dac-89a4-b9e12f1a79df-c000.csv does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by recreating the Dataset/DataFrame involved.
Any help will be appreciated.
Use HBase as your store for static. It is more work for sure but allows for concurrent updating.
Where I work, all Spark Streaming uses HBase for lookup of data. Far faster. What if you have a 100M customers for a microbatch of 10k records? I know it was a lot of work initially.
See https://medium.com/@anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
If you have a small static ref table, then static join is fine, but you also have updating, causing issues.