I have the below test data.
A B C
M O
M M M
M M M
N O
P N
I would like to get the total count of entries in this sample test data i.e 12
I have the below code to do the same, but i am getting an incorrect result.
Any help on how to rectify would be helpful.
test= LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray);
values = FOREACH test GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;
grp = GROUP values ALL;
counting = FOREACH grp GENERATE group, COUNT(values.A)+COUNT(values.B)+COUNT(values.C);
This is giving answer as 15, rather than 12 .
I would also like to get the count of each of these values,like M=7, N=2, O=2, P=1. I have written the below code.
test= LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray);
values = FOREACH test GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;
grp = GROUP values ALL;
A = FOREACH grp {
B =FILTER test.A=='M' OR test.B=='M' OR test.C=='M';
GENERATE group, COUNT(B);
};
I am getting an error "Scalar has more than one row in the output".
You are counting the column names as well in your final count.Modify the script to ignore the first row and then group by and count.
test= LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray);
ranked = rank test;
test1 = Filter ranked by ($0 > 1); --Note:rank_test should work.
values = FOREACH test1 GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;
grp = GROUP values ALL;
counting = FOREACH grp GENERATE group, COUNT(values.A)+COUNT(values.B)+COUNT(values.C);