I have one table loaded twice to perform a self join called current and previous. Both contain columns "key" (not unique) and "value". I have grouped by key, and counted the number of values in each group of keys.
I would like to find how many more values were added to the current table compared to the previous table, but I get the error "Invalid scalar projection: cur_count : A column needs to be projected from a relation for it to be used as a scalar". I am relatively new to pig latin, so I'm unsure of what the syntax should be for performing this difference.
Please disregard syntax for the cur_count and prev_count.
cur_count = FOREACH cur_grouped GENERATE COUNT(current);
prev_count = FOREACH prev_grouped GENERATE COUNT(previous);
left_join = join current by key LEFT OUTER, previous by key-1;
difference = FOREACH left_join GENERATE key, cur_count-prev_count; //error here
dump difference;
Below are some sample data
key value
1 12
1 34
1 11
1 45
2 4
3 34
3 34
3 23
4 15
4 19
What my script does so far: it counts the number of values in each group of keys
key count
1 4
2 1
3 3
4 2
I would like to find the difference in number of values between a key and the previous key
key difference
2 -3
3 2
4 -1
cur_count and prev_count are relations and cannot be used the way you are using.You can achieve the desired output using the script below.After joining the relations with (key-1),use the columns from the relation to get the difference.
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int);
B = GROUP A BY f1;
C = FOREACH B GENERATE group,COUNT(A);
D = FOREACH B GENERATE group,COUNT(A);
E = JOIN C BY $0,D BY ($0-1);
F = FOREACH E GENERATE $2,$3-$1;
DUMP F;