Search code examples
hadoopapache-pig

Hadoop Pig Ordered Analytical Functions


I am new in Pig and would like to use an ordered analytical function, similar to what is possible in SQL.

My data looks something like this:

(stock_symbol,date,stock_price_open,stock_price_close)
(TAC,2001-08-06,16.39,16.36)
(TAC,2001-08-07,16.3,16.54)
(TAC,2001-08-08,16.55,16.44)
(TAC,2001-08-09,16.45,16.48)
(TAC,2001-08-10,16.5,15.8)

What I want to do is find the change in opening stock price from day to day. So, my output would look something like this:

(stock_symbol,date,stock_price_open,stock_price_close,stock_price_change)
(TAC,2001-08-06,16.39,16.36,NULL)
(TAC,2001-08-07,16.3,16.54,-0.09)
(TAC,2001-08-08,16.55,16.44,0.25)
(TAC,2001-08-09,16.45,16.48,-0.1)
(TAC,2001-08-10,16.5,15.8,0.05)

I want Pig to be able to look at a row ahead or behind the current row. Is this possible, or does Pig not allow for this type of analysis?


Solution

  • You can use the below script to get the output as expected, but might be some fine tuning is required.

    A = load '/tmp/pig/test/test' using PigStorage (',');
    B= foreach A generate $0 as stock_symbol, ToDate($1,'yyyy-mm-dd') as dt,(double)$2 as stock_price_open, (double)$3 as stock_price_close,'PT24H' as dthour;
    C= foreach B generate $0 as stock_symbol, $1 as dt_curr, SubtractDuration($1,$4) as dt_old, $2 as stock_price_open, $3 as stock_price_close;
    START = FILTER C BY ($1 == $1);
    D = JOIN C by $0 , START by $0;
    Filter_D = FILTER D by ((DaysBetween($1,$6)==1) and (DaysBetween($2,$7)==1));
    E = foreach Filter_D generate $0 as stock_symbol, $1 as dt, $3 as stock_price_open, $4 as stock_price_close, $3-$8 as stock_price_change;
    

    The Output as :

    (TAC,2001-01-07T00:08:00.000-08:00,16.3,16.54,-0.08999999999999986)
    (TAC,2001-01-08T00:08:00.000-08:00,16.55,16.44,0.25)
    (TAC,2001-01-09T00:08:00.000-08:00,16.45,16.48,-0.10000000000000142)
    (TAC,2001-01-10T00:08:00.000-08:00,16.5,15.8,0.05000000000000071)
    

    As we required to calculate the One Day Older Opening Date so have taking variable "PT24H" which defined 24 Hours in Pig. The Same was printed in next action by using ToDate() & SubtractDuration(), followed by a Join and DaysBetween() action to get the difference.

    ToDate(),SubtractDuration(),DaysBetween() are inbilt Function in PIG UDF, u can write suitable UDF for fine tuning the same script, with more proper action.