Search code examples
apache-pig

Pig Latin Count difference between two tables


I have one table loaded twice to perform a self join called current and previous. Both contain columns "key" (not unique) and "value". I have grouped by key, and counted the number of values in each group of keys.

I would like to find how many more values were added to the current table compared to the previous table, but I get the error "Invalid scalar projection: cur_count : A column needs to be projected from a relation for it to be used as a scalar". I am relatively new to pig latin, so I'm unsure of what the syntax should be for performing this difference.

Please disregard syntax for the cur_count and prev_count.

cur_count = FOREACH cur_grouped GENERATE COUNT(current);
prev_count = FOREACH prev_grouped GENERATE COUNT(previous);

left_join = join current by key LEFT OUTER, previous by key-1;
difference = FOREACH left_join GENERATE key, cur_count-prev_count; //error here
dump difference;

Below are some sample data

key     value
1         12
1         34
1         11
1         45
2          4
3         34
3         34
3         23
4         15
4         19

What my script does so far: it counts the number of values in each group of keys

key     count
1         4
2         1
3         3
4         2

I would like to find the difference in number of values between a key and the previous key

key     difference 
 2          -3
 3           2
 4          -1  

Solution

  • cur_count and prev_count are relations and cannot be used the way you are using.You can achieve the desired output using the script below.After joining the relations with (key-1),use the columns from the relation to get the difference.

    A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int);
    B = GROUP A BY f1;
    C = FOREACH B GENERATE group,COUNT(A);
    D = FOREACH B GENERATE group,COUNT(A);
    E = JOIN C BY $0,D BY ($0-1);
    F = FOREACH E GENERATE $2,$3-$1;
    DUMP F;
    

    Output