Search code examples
javaapache-stormtrident

storing intermediate data in storm topology


I am reading data from 2 kafka topics. Which can be described as:

Topic1 data content: VehicleRegistrationNo, Timestamp, Location Topic2 data content: VehicleRegistrationNo, Timestamp, Speed

I need to merge these 2 messages based on nearest timestamp in both and output tuple as message VehicleRegistrationNo, Timestamp, Speed, Location. I am reading these topic via 2 spouts S1 and S2. Then bolt MergeS1andS2 takes input from these spouts and works as:

if (message from S1): save present message from S1 along with 2 previous messages (3 consecutive locations) to LocationHashMap elseif (message from S2): get locations details from LocationHashmap and merge speed for same Vehicle with location info, then send details to next bolt as tuple

I know HashMap is not efficient way of storing data in multinode. So I read about Trident and Redis to store intermediate data. What should I use to store my intermediate data in this senario which can work in distributed topology.


Solution

  • Any no-sql database will do the trick. Pick a key that uniquely identifies the tuple regardless from which topic is coming from. The logic would be something like:

    • Try to lookup the tuple from the database.
    • If tuple does not exist in database, store the tuple that you got from the topic into the database.
    • If tuple exists, merge contents of database tuple and topic tuple and store the resulting tuple back into the database (overwriting the contents of the previous tuple in the database)