Search code examples
apache-stormword-counttrident

Word Count using Storm or Trident


For a simple word count program in the storm-starter, the logic is fairly straight-forward:
1) split sentence into words
2) emit each word
3) aggregate the count (store the count in a map)

However, there are two problems here:
1) the program uses 12 individual threads to execute the aggregation part, which means the count is not GLOBAL, we have to add one more layer to get the global count?
2) in the bolt, maps are used to store the count, which means it has state, what if the current worker fails, all counts stored in the bolt are gone? since storm is stateless
3) should we use Trident to achieve this instead?


Solution

  • Each bolt contains 1/12th of the words for the global state. The fields grouping sends specific words to the same bolt each time so the counts are accurate globally.

    https://storm.apache.org/documentation/Concepts.html

    Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.

    Yes, the counts would be lost if the node crashed. Persistent storage should be used in accordance with your application's tolerance to inaccuracy and required performance characteristics.

    Trident helps you build states that do exactly once processing (counting in this example). If the backing map in the example was HBase, it would be resilient to bolt crashes, but you would either lose data when the bolt restarted (best effort processing), or over count words if the sentence tuple was replayed (at least once processing). If you need to count things once, Trident is the way to go.