Search code examples
apache-pig

Counting elements for each group using Pig


I'm trying to group and count the frequency of terms for each group in PigLatin, but I'm having some troubles to figure it out how to do it.

I have a collection of objects with the following schema:

{cluster_id: bytearray,terms: chararray}

And here is some samples

(10, smerter)
(10, graviditeten)
(10, smerter)
(10, smerter)
(10, udemærket)    
(20, eis feuer)
(20, herunterladen schau)
(20, download gratis)
(20, download gratis) 
(30, anschauen kinofilm)
(30, kauf rechnung)
(30, kauf rechnung)
(30, versandkostenfreie lieferung)
(30, kostenlose)
(30, kostenlose)
(30, kostenlose) 

the result I m trying to get is something like this

(10, smerter, 3)
(10, graviditeten, 2)
(10, udemærket, 1)
(20, download gratis, 2)
(20, eis feuer, 1)
(20, herunterladen schau, 1)    
(30, kostenlose, 3)
(30, kauf rechnung, 2)
(30, anschauen kinofilm, 1)
(30, versandkostenfreie lieferung, 1)

What would be the best way to do that? The following code groups by id and count the terms, but I wanted to count the terms for each group.

by_clusters = GROUP sample_data by cluster_id;
by_clusters_terms_count = FOREACH by_clusters GENERATE group as cluster_id, COUNT($1);

I make the grouping like this I end up with an object with the following schema

by_clusters: {group: bytearray,sample_data: {(cluster_id: bytearray,terms: chararray)}}

Now, I get to the point to actually count the terms inside the 'sample_data' tuple. I'm thinking about nested foreach, but I still didn't get it how could I apply it in this case. The code would be something like the following:

result = FOREACH by_clusters {

--count terms here, I don't know how

-- compiler gives me an error here
c = GROUP $1 BY terms; -- 
d = FOREACH c GENERATE COUNT(b), group;

GENERATE cluster_id, d;
}

Error I get:

ERROR 1200: Syntax error, unexpected symbol at or near '$1

Finally, I think I'm close, but I'm unable to solve it. I don't believe I'll have to write an UDF in this case.


Solution

  • I think that what you want to do is simply group by cluster_id and terms.

    You were very close to the result with you first try, just add terms to your group :

    by_clusters = GROUP sample_data by (cluster_id, terms);
    by_clusters_terms_count = FOREACH by_clusters GENERATE FLATTEN(group) as (cluster_id, terms), COUNT($1);
    

    I hope I understood well what you want!