Search code examples
apache-pighortonworks-data-platform

Calculate percentage on boolean column


Assuming my data has the following structure :

Year      | Location | New_client 

2018      | Paris    | true
2018      | Paris    | true
2018      | Paris    | false
2018      | London   | true
2018      | Madrid   | true
2018      | Madrid   | false
2017      | Paris    | true

I'm trying to calculate for each year and location the percentage of true value for New_client, so an example taking the records from the structure example would be

2018     | Paris    | 66
2018     | London   | 100
2018     | Madrid   | 50
2017     | Paris    | 100

Adapting from https://stackoverflow.com/a/13484279/2802552 my current script is but the difference is that instead of 1 column it's using 2 columns (Year and Location)

data = load...
grp = group inpt by Year; -- creates bags for each value in col1 (Year)
result = FOREACH grp {
    total = COUNT(data);
    t = FILTER data BY New_client == 'true'; --create a bag which contains only T values
    GENERATE FLATTEN(group) AS Year, total AS TOTAL_ROWS_IN_INPUT_TABLE, 100*(double)COUNT(t)/(double)total AS PERCENTAGE_TRUE_IN_INPUT_TABLE;
};

The problem is this uses Year as reference while I need it to be Year AND District.

Thanks for your help.


Solution

  • You need to group by both Year and Location, which will require two modifications. First, add Location to the group by statement. Second, change FLATTEN(group) AS Year to FLATTEN(group) AS (Year, Location) since group is now a tuple with two fields.

    grp = group inpt by (Year, Location);
    result = FOREACH grp {
        total = COUNT(inpt);
        t = FILTER inpt BY New_client == 'true';
        GENERATE 
            FLATTEN(group) AS (Year, Location), 
            total AS TOTAL_ROWS_IN_INPUT_TABLE, 
            100*(double)COUNT(t)/(double)total AS PERCENTAGE_TRUE_IN_INPUT_TABLE;
    };