Search code examples
hadoopapache-pigbigdata

How to find average of a column and average of subtraction of two columns in Pig?


I am new to scripting using Pig Latin. I am stuck to write a pig script which will find the average of a column value and also to find the average of the subtracted values between two columns.

I am reading the data from a csv file having starttime and endtime columns as below:

"starttime","endtime",
"23","46",
"32","49",
"54","59"

The code that I have tried so far is as below :

file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:int, et:int);
start_ts = FOREACH file GENERATE st;
grouped = group start_ts by st
ILLUSTRATE grouped

The ILLUSTRATE output I am getting is as below and I am not able to apply the AVG function.

------------------------------------------
-------------------------------------------------------------------------------------
| grouped     | group:int     | file:bag{:tuple(st:int,et:int)}                 | 
-------------------------------------------------------------------------------------
|             |               | {(, ), (, )}                                        | 
-------------------------------------------------------------------------------------

Can anybody please help me getting the average of the starttime which would be the result of (23 + 32 + 54)/3

And also some ideas on how to code the (endtime -starttime)/no. of records (i.e 3 in this case) would be of great help for me to get started.

Thanks.


Solution

  • First ensure that you are loading the data correctly.Looks like you have double quotes i.e " around your data.Load the data as chararray,replace the double quotes and then cast it to int,finally apply the AVG function for the starttime.For the avg of endtime - starttime just subtract the 2 fields and apply AVG.

    A = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:chararray, et:chararray);
    B = FOREACH A GENERATE (int)REPLACE(st,'\\"','') as st,(int)REPLACE(et,'\\"','') as et;
    C = GROUP B ALL;
    D = FOREACH C GENERATE AVG(B.st),AVG(B.et - B.st);