Search code examples
apache-sparkspark-streamingspark-structured-streaming

How to query a persisted dataframe in spark job (A) from another spark job (B)


diagramThere are two spark streaming jobs running on different containers - let's call them teacher job and student job. Both are reading from two different kafka topics. When a student message comes into the student spark job, I need to 'query' the teacher job's persisted data to retrieve the teacher associated with that student (in this example, the student only has one teacher but a teacher can have many students). How can I persist a key value pair(or a teacher data frame) in the teacher job and then retrieve/lookup that teacher in the student job so I can process that student knowing it's teacher? Can I use persist() in one job and unpersist() in another?


Solution

  • From the evidence it appears that Spark Structured Streaming with Kafka Integration utilizing Stream - Stream Join is the way to go.