Search code examples
javahadoophistogramapache-pigbinning

Pig: Group by ranges/ binning data


I have a set of integer values that I would like to group into a bunch of bins.

Example: Say I have a thousand points between 1 and 1000, and I want to do 20 bins.

Is there anyway to group them into a bin/array?

Also, I will not know ahead of time how wide the range will be, so I can't hardcode any specific values.


Solution

  • If you have the min and max, you can divide the range by the number of bins. For example,

    -- foo.pig
    ids = load '$INPUT' as (id: int);
    ids_with_key = foreach ids generate (id - $MIN) * $BIN_COUNT / ($MAX- $MIN + 1) as bin_id, id;
    group_by_id = group ids_with_key by bin_id;
    bin_id = foreach group_by_id generate group, flatten(ids_with_key.id);
    dump bin_id;
    

    Then you can use the following command to run it:

    pig -f foo.pig -p MIN=1 -p MAX=1000 -p BIN_COUNT=20 -p INPUT=your_input_path
    

    The idea behind the script is that we can divide the range [MIN, MAX] by BIN_COUNT to get the size of every bin: (MAX - MIN + 1) / BIN_COUNT, called BIN_SIZE. Then we map the id to the bin number: (id - MIN) / BIN_SIZE, and group them.