Search code examples
apache-pigbigdata

Time differences in Apache Pig?


In a Big Data context I have a time series S1=(t1, t2, t3 ...) sorted in an ascending order. I would like to produce a series of time differences: S2=(t2-t1, t3-t2 ...)

  1. Is there a way to do this in Apache Pig? Short of a very inefficient self-join, I do not see one.

  2. If not, what would be an good way to do this suitable for large amounts of data?


Solution

    1. S1 = Generate Id,Timestamp i.e. from t1...tn
    2. S2 = Generate Id,Timestamp i.e. from t2...tn
    3. S3 = Join S1 by Id,S2 by Id
    4. S4 = Extract S1.Timestamp,S2.Timestamp,(S2.TimeStamp - S1.TimeStamp)

    Edit

    Sample Data

    2014-02-19T01:03:37
    2014-02-26T01:03:39
    2014-02-28T01:03:45
    2014-04-01T01:04:22
    2014-05-11T01:06:02
    2014-06-30T01:08:56
    

    Script

    s1 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
    s11 = foreach s1 generate ToDate(t) as t1;
    s1_new = rank s11;
    
    s2 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
    s22 = foreach s2 generate ToDate(t) as t1;
    s2_new = rank s22;
    
    -- Filter records by excluding the 1 ranked row and rank the new data
    ss = FILTER s2_new by (rank_s22 > 1);
    ss_new = rank ss;
    
    s3 = join s1_new by rank_s11,ss_new by rank_ss;
    s4 = foreach s3 generate DaysBetween(ss_new::t1,s1_new::t1) as time_diff;
    
    DUMP s4;
    

    Difference in Days