I am new to scripting using Pig Latin. I am stuck to write a pig script which will find the average of a column value and also to find the average of the subtracted values between two columns.
I am reading the data from a csv file having starttime and endtime columns as below:
"starttime","endtime",
"23","46",
"32","49",
"54","59"
The code that I have tried so far is as below :
file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:int, et:int);
start_ts = FOREACH file GENERATE st;
grouped = group start_ts by st
ILLUSTRATE grouped
The ILLUSTRATE output I am getting is as below and I am not able to apply the AVG function.
------------------------------------------
-------------------------------------------------------------------------------------
| grouped | group:int | file:bag{:tuple(st:int,et:int)} |
-------------------------------------------------------------------------------------
| | | {(, ), (, )} |
-------------------------------------------------------------------------------------
Can anybody please help me getting the average of the starttime which would be the result of (23 + 32 + 54)/3
And also some ideas on how to code the (endtime -starttime)/no. of records (i.e 3 in this case) would be of great help for me to get started.
Thanks.
First ensure that you are loading the data correctly.Looks like you have double quotes i.e " around your data.Load the data as chararray,replace the double quotes and then cast it to int,finally apply the AVG function for the starttime.For the avg of endtime - starttime just subtract the 2 fields and apply AVG.
A = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:chararray, et:chararray);
B = FOREACH A GENERATE (int)REPLACE(st,'\\"','') as st,(int)REPLACE(et,'\\"','') as et;
C = GROUP B ALL;
D = FOREACH C GENERATE AVG(B.st),AVG(B.et - B.st);