I am reading data from 2 kafka topics. Which can be described as:
Topic1 data content: VehicleRegistrationNo, Timestamp, Location
Topic2 data content: VehicleRegistrationNo, Timestamp, Speed
I need to merge these 2 messages based on nearest timestamp in both and output tuple as message VehicleRegistrationNo, Timestamp, Speed, Location
. I am reading these topic via 2 spouts S1
and S2
. Then bolt MergeS1andS2
takes input from these spouts and works as:
if (message from S1):
save present message from S1 along with 2 previous messages (3 consecutive locations) to LocationHashMap
elseif (message from S2):
get locations details from LocationHashmap and merge speed for same Vehicle with location info, then send details to next bolt as tuple
I know HashMap is not efficient way of storing data in multinode. So I read about Trident and Redis to store intermediate data. What should I use to store my intermediate data in this senario which can work in distributed topology.
Any no-sql database will do the trick. Pick a key that uniquely identifies the tuple regardless from which topic is coming from. The logic would be something like: