Search code examples
apache-pigamazon-emr

pig script loop though calculate averages


I have data that will be run in pig using aws emr looks like. The columns are called model, year, units_sold, total_customers.

chevy     1900     1000    49
chevy     1901     73      92
chevy     1902     45      65
chevy     1903     300     75
ford      1900     35      12
ford      1901     777     32
ford      1902     932     484
ford      1903     33      15

What I am trying to do is calculate the average for every car type. the averages will be calculated by adding the sum of units_sold, divided by the sum of total_customers. so the desired result would look like

chevy    (1000+73+45+300) / (49+92+65+75) = 5.04
ford     (35+777+932+33) / (12+32+484+15) = 3.27

in my script i have

A = *Step to load data*;
B = GROUP A by year;
C = results = FOREACH B GENERATE SUM(units_sold)/SUM(total_customers);
dump C;

This returns an incorrect result.How can I achieve results that look like

chevy    5.04
ford     3.27

Solution

  • Looks like you need to group by car type, not year. Also, you might need to cast to float if units_sold and total_customers are integers if you don't want a rounded result. Try:

    B = GROUP A by model;
    C = FOREACH B GENERATE (float)SUM(units_sold)/(float)SUM(total_customers);