Search code examples
keyapache-pigkey-value

Pig - Multiple Values For Key


I have written a Pig script that will perform some image processing via Python UDFs.

After doing some operations, I have something like (for example):

A = load 'data.txt' using PigStorage('|') as (name:chararray, pixelIntensity:float);

B = group A by pixelIntensity;

dump B;

B is then something like this:

(131.0,{(image1,jpg,131.0), (image2.jpg,131.0), (image3.jpg,131.0)})
(140.0,{(image5.jpg,140.0), (image5.jpg,140.0)})
(150.0,{(image4.jpg,150.0})

If I were to go

dump A;

I'd get the following:

(image1.jpg,131.0)
(image2.jpg,131.0)
(image3.jpg,131.0)
(image4.jpg,150.0)
(image5.jpg,140.0)

So I've basically grouped them using their average pixel intensity as the key.

My question is this:

Am I able to extract only 1 element from each row in B? So for example, I'll have like

(image1.jpg,131.0)
(image4.jpg,150.0)
...

Solution

  • A nested FOREACH with a LIMIT should do what you want:

    A = LOAD 'data' using PigStorage(',') AS (name:chararray,pixelIntensity:float);
    B = GROUP A BY pixelIntensity;
    C = FOREACH B {
        D = LIMIT A 1;
        GENERATE flatten(D);
    };
    STORE C INTO 'res';