Search code examples
apache-pig

how to calculate at time series in pig


Lets if I write DUMP monthly, I get:

(Jan,2)
(Feb,102)
(Mar,250)
(Apr,450)
(May,590)
(Jun,790)
(Jul,1040)
(Aug,1260)
(Sep,1440)
(Oct,1770)
(Nov,2000)
(Dec,2500)

Checking schema:

DESCRIBE monthly;

Output:

monthly: {group: chararray,total_case: long}

I need to calculate increase rate for each month. So, for February, it will be:

(total_case in Feb - total_case in Jan) / total_case in Jan = (102 - 2) / 2 = 50

For March it will be: (250 - 102) / 102 = 1.45098039

So, if I put the records in monthlyIncrease, by writing DUMP monthlyIncrease, I will get:

(Jan,0)
(Feb,50)
(Mar,1.45098039)
........
........
(Dec, 0.25)

Is it possible in pig? I can't think of any way to do this.


Solution

  • Possible. Create a similar relation say b.Sort both relations by month. Rank both relations a,b. Join on a.rank = b.rank + 1 and then do the calculations.You will have to union the (Jan,0) record.

    Assuming monthly is sorted by the group(month)

    monthly = LOAD '/test.txt' USING PigStorage('\t') as (a1:chararray,a2:int);
    a = rank monthly;  
    b = rank monthly;   
    c = join a by $0, b by ($0 + 1);  
    d = foreach c generate a::a1,(double)((a::a2 - b::a2)*1.0/(b::a2)*1.0);  
    e = limit monthly 1; 
    f = foreach e generate e.$0,0.0; 
    g = UNION d,f;
    dump g;
    

    Result

    enter image description here