Search code examples
apache-pig

Find continuity of elements in Pig


how can i find the continuity of a field and starting position The input is like

A-1
B-2
B-3
B-4
C-5
C-6

The output i want is

A,1,1
B,3,2
C,2,5

Thanks.


Solution

  • Assuming you do not have discontinuous data with respect to a value, you can get the desired results by first grouping on value and using COUNT and MIN to get continuous_counts and start_index respectively.

    A = LOAD 'data' USING PigStorage('-') AS (value:chararray;index:int);
    
    B = FOREACH (GROUP A BY value) GENERATE
        group as value,
        COUNT(A) as continuous_counts,
        MIN(A.value) as start_index;
    
    STORE B INTO 'output' USING PigStorage(',');
    

    If your data does have the possibility of discontinuous data, the solution is not longer trivial in native pig and you might need to write a UDF for that purpose.