I have a Spark Streaming job, which when it starts, queries Hive and creates a Map[Int, String] object, which is then used for parts of the calculations the job performs.
The problem I have is that the data in Hive has the potential changes every 2 hours. I would like to have the ability to refresh the static data on a schedule, without having to restart the Spark Job every time.
The initial load of the Map object takes around a 1minute.
Any help is very welcome.
You can use a listener. Which will be triggered every time when a job is started for any stream within the spark context. Since your db is updated every two hours there is no harm updating it every-time AFAIK.
sc.addSparkListener(new SparkListener() {
override def onSparkListenerJobStart(jobStart: SparkListenerJobStart) {
//load data that to the map that will be sent to executor
}
});