I am working on Hadoop performance analysis and I am running some benchmarks on Hadoop. What's surprising is that Grep takes almost 1/10 of the time it takes wordcount to run which is very non-intuitive. Can anyone explain why is this true?
A lot of the work in the map-reduce idiom is the communication between mappers and reducers.
In the WordCount example, every word results in an output record (and a reducer input). In the Grep example, every matched pattern results in an output record. If the pattern doesn't match very often, that's not very many records.
I would expect the mappers to run in roughly the same amount of time, since both will be I/O bound, up to the point where they produce output. The CPU difference between the two tasks is negligible. However, a big difference between the amount of output will be highly noticeable.