I am new in Pig and would like to use an ordered analytical function, similar to what is possible in SQL.
My data looks something like this:
(stock_symbol,date,stock_price_open,stock_price_close)
(TAC,2001-08-06,16.39,16.36)
(TAC,2001-08-07,16.3,16.54)
(TAC,2001-08-08,16.55,16.44)
(TAC,2001-08-09,16.45,16.48)
(TAC,2001-08-10,16.5,15.8)
What I want to do is find the change in opening stock price from day to day. So, my output would look something like this:
(stock_symbol,date,stock_price_open,stock_price_close,stock_price_change)
(TAC,2001-08-06,16.39,16.36,NULL)
(TAC,2001-08-07,16.3,16.54,-0.09)
(TAC,2001-08-08,16.55,16.44,0.25)
(TAC,2001-08-09,16.45,16.48,-0.1)
(TAC,2001-08-10,16.5,15.8,0.05)
I want Pig to be able to look at a row ahead or behind the current row. Is this possible, or does Pig not allow for this type of analysis?
You can use the below script to get the output as expected, but might be some fine tuning is required.
A = load '/tmp/pig/test/test' using PigStorage (',');
B= foreach A generate $0 as stock_symbol, ToDate($1,'yyyy-mm-dd') as dt,(double)$2 as stock_price_open, (double)$3 as stock_price_close,'PT24H' as dthour;
C= foreach B generate $0 as stock_symbol, $1 as dt_curr, SubtractDuration($1,$4) as dt_old, $2 as stock_price_open, $3 as stock_price_close;
START = FILTER C BY ($1 == $1);
D = JOIN C by $0 , START by $0;
Filter_D = FILTER D by ((DaysBetween($1,$6)==1) and (DaysBetween($2,$7)==1));
E = foreach Filter_D generate $0 as stock_symbol, $1 as dt, $3 as stock_price_open, $4 as stock_price_close, $3-$8 as stock_price_change;
The Output as :
(TAC,2001-01-07T00:08:00.000-08:00,16.3,16.54,-0.08999999999999986)
(TAC,2001-01-08T00:08:00.000-08:00,16.55,16.44,0.25)
(TAC,2001-01-09T00:08:00.000-08:00,16.45,16.48,-0.10000000000000142)
(TAC,2001-01-10T00:08:00.000-08:00,16.5,15.8,0.05000000000000071)
As we required to calculate the One Day Older Opening Date so have taking variable "PT24H" which defined 24 Hours in Pig. The Same was printed in next action by using ToDate() & SubtractDuration(), followed by a Join and DaysBetween() action to get the difference.
ToDate(),SubtractDuration(),DaysBetween() are inbilt Function in PIG UDF, u can write suitable UDF for fine tuning the same script, with more proper action.