Search code examples
algorithmhadoopgrepword-count

Hadoop performance analysis (Wordcount vs Grep)


I am working on Hadoop performance analysis and I am running some benchmarks on Hadoop. What's surprising is that Grep takes almost 1/10 of the time it takes wordcount to run which is very non-intuitive. Can anyone explain why is this true?


Solution

  • A lot of the work in the map-reduce idiom is the communication between mappers and reducers.

    In the WordCount example, every word results in an output record (and a reducer input). In the Grep example, every matched pattern results in an output record. If the pattern doesn't match very often, that's not very many records.

    I would expect the mappers to run in roughly the same amount of time, since both will be I/O bound, up to the point where they produce output. The CPU difference between the two tasks is negligible. However, a big difference between the amount of output will be highly noticeable.