I have data that will be run in pig using aws emr looks like. The columns are called model, year, units_sold, total_customers.
chevy 1900 1000 49
chevy 1901 73 92
chevy 1902 45 65
chevy 1903 300 75
ford 1900 35 12
ford 1901 777 32
ford 1902 932 484
ford 1903 33 15
What I am trying to do is calculate the average for every car type. the averages will be calculated by adding the sum of units_sold, divided by the sum of total_customers. so the desired result would look like
chevy (1000+73+45+300) / (49+92+65+75) = 5.04
ford (35+777+932+33) / (12+32+484+15) = 3.27
in my script i have
A = *Step to load data*;
B = GROUP A by year;
C = results = FOREACH B GENERATE SUM(units_sold)/SUM(total_customers);
dump C;
This returns an incorrect result.How can I achieve results that look like
chevy 5.04
ford 3.27
Looks like you need to group by car type, not year. Also, you might need to cast to float if units_sold
and total_customers
are integers if you don't want a rounded result. Try:
B = GROUP A by model;
C = FOREACH B GENERATE (float)SUM(units_sold)/(float)SUM(total_customers);