Hi I have some query log files of the following form:
q_string q_visits q_date
0 red ballons 1790 2012-10-02 00:00:00
1 blue socks 364 2012-10-02 00:00:00
2 current 280 2012-10-02 00:00:00
3 molecular 259 2012-10-02 00:00:00
4 red table 201 2012-10-02 00:00:00
I have a file per day, for each month for the period of a year. What I would like to do is:
(1) Group the files by month (or more specifically group all of the q_strings belonging to each month)
(2) Since the same q_string may appear on multiple days, I would like to group the same q_strings within the month, summing on q_visits across all the instances of that q_string
(3) Normalise the q_visits against the grouped q_string (by dividing the sum of q_visits for the grouped q_string by the sum of q_visits across all q_strings within the month)
I expect the output to have a similar schema to the input except to have an extra column with normalised monthly q_visit volumes.
I have been doing this in Python/Pandas, but now have more data and feel that the problem lends itself more easily to MapReduce.
Would the above be easy to implement in EMR/AWS? Conceptually, what would be the MR workflow for doing the above? I would like to keep coding in Python so will probably use streaming.
Thanks in advance for any help.
I would rather use Pig. Easy to learn and write, no lengthy pieces of code. Just express your data processing in terms of transformation, or a data flow and get the desired result. If it fits into your needs, it's way better than raw MR jobs. Pig was developed for these kinda stuff. It'll definitely save a lot of time.