hadoop mapreduce hadoop-streaming elastic-map-reduce

Log file analysis in Hadoop/MapReduce

Hi I have some query log files of the following form:

    q_string    q_visits    q_date
0   red ballons 1790        2012-10-02 00:00:00
1   blue socks  364         2012-10-02 00:00:00
2   current     280         2012-10-02 00:00:00
3   molecular   259         2012-10-02 00:00:00
4   red table   201         2012-10-02 00:00:00

I have a file per day, for each month for the period of a year. What I would like to do is:

(1) Group the files by month (or more specifically group all of the q_strings belonging to each month)

(2) Since the same q_string may appear on multiple days, I would like to group the same q_strings within the month, summing on q_visits across all the instances of that q_string

(3) Normalise the q_visits against the grouped q_string (by dividing the sum of q_visits for the grouped q_string by the sum of q_visits across all q_strings within the month)

I expect the output to have a similar schema to the input except to have an extra column with normalised monthly q_visit volumes.

I have been doing this in Python/Pandas, but now have more data and feel that the problem lends itself more easily to MapReduce.

Would the above be easy to implement in EMR/AWS? Conceptually, what would be the MR workflow for doing the above? I would like to keep coding in Python so will probably use streaming.

Thanks in advance for any help.

Solution

I would rather use Pig. Easy to learn and write, no lengthy pieces of code. Just express your data processing in terms of transformation, or a data flow and get the desired result. If it fits into your needs, it's way better than raw MR jobs. Pig was developed for these kinda stuff. It'll definitely save a lot of time.