Search code examples
apache-pig

Count of values across fields in Pig


I have the below test data.

A   B   C

M   O

M   M   M

M   M   M

N       O

P       N

I would like to get the total count of entries in this sample test data i.e 12

I have the below code to do the same, but i am getting an incorrect result.

Any help on how to rectify would be helpful.

test=  LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray); 
values = FOREACH test GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;  
grp = GROUP values ALL;  
counting = FOREACH grp GENERATE group, COUNT(values.A)+COUNT(values.B)+COUNT(values.C); 

This is giving answer as 15, rather than 12 .

I would also like to get the count of each of these values,like M=7, N=2, O=2, P=1. I have written the below code.

test=  LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray); 
values = FOREACH test GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;  
grp = GROUP values ALL;  
    A = FOREACH grp {
B =FILTER test.A=='M' OR test.B=='M' OR test.C=='M';
GENERATE group, COUNT(B);
};

I am getting an error "Scalar has more than one row in the output".


Solution

  • You are counting the column names as well in your final count.Modify the script to ignore the first row and then group by and count.

    test=  LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray); 
    
    ranked = rank test;
    test1 = Filter ranked by ($0 > 1); --Note:rank_test should work.
    
    values = FOREACH test1 GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;  
    grp = GROUP values ALL;  
    counting = FOREACH grp GENERATE group, COUNT(values.A)+COUNT(values.B)+COUNT(values.C);