Search code examples
pythonpandasdataframedata-manipulation

Aggregate a dataframe column based on a hierarichal condition from another column


I have an interesting problem and thought I will share it here for everyone. Let's assume we have a pandas DataFrame like this (dummy data):

Category Samples
A,B,123 6
A,B,456 3
A,B,789 1
X,Y,123 18
X,Y,456 7
X,Y,789 2
P,Q,123 1
P,Q,456 2
P,Q,789 2
L,M,123 1
L,M,456 3
S,T,123 5
S,T,456 5
S,T,789 3

The value in category are basically hierarchal in nature. Think of A as country, B as state, and 123 as zip-code. What I want is to greedily match for each category which has less than 5 samples and merge it with the nearest one. The final example DataFrame should be like:

Category Samples
A,B,123 10
X,Y,123 18
X,Y,456 9
P,Q,456 5
L,M,456 4
S,T,123 8
S,T,456 5

These are the possible rules I see that will be needed:

Case A,B : Sub-categories 456, 789 have less than 5 categories so we merge it but then the merged one will also have 4 which is less than 5 so it gets further merged with 123 and thus finally we get A,B,123 with 10.

Case X,Y : Subcategory 789 is the only one with less than 5 so it will merge with the category 456 (the one closest to 5 samples) to become X,Y,456 as 9 where X,Y,123 always had more than 5 so it remains as is.

Case P,Q : Here all the sub-categories have less then 5 but the idea is to merge it one at a time and it has nothing to do with the sequence. 123 here is having one sample, so it will merge with 789 to form a sample size of 3 which is still less than 5 so it will merge with 456 to form P,Q,456 with sample size of 5 but it can also be P,Q,789 as 5. Either is fine.

Case L,M : Only two sub-categories and both even merged will remain less than 5 but that's the best we can have so it should be L,M,456 as 4.

Case S,T : Only 789 has less than 5 so it can go with either 123 or 456 (as both have same samples), but not both. So the answer should be either S,T,123 as 5 and S,T,456 as 8 or S,T,123 as 8 and S,T,456 as 5.

What happens if there is a third column with values and based on the logic we want them to be merged too - add up if its an integer, and concatenate if that's a string - based on whatever condition we use on these columns?

I have been trying to break the column category then work with samples to add up but so far no luck. Any help is greatly appreciated.


Solution

  • Very tricky question especially with the structure of your data(because your grouper which is really the parts "A,B", "X,Y", etc. are not in a separate column. But I think you can do:

    df.sort_values(by='Samples', inplace=True, ignore_index=True)
    #grouper containing groupby keys ['A,B', 'X,Y', etc.)
    g = df['Category'].str.extract("(.*),+")[0] 
    #create a column to keep the category together
    df['sample_category'] = list(zip(df['Samples'], df['Category'])) 
    

    Then use functools.reduce to reduce the list by iteratively grabbing the next tuple if sample is less than 5:

    df2 = df.groupby(g, as_index=False).agg(
    {'sample_category': lambda s: 
     functools.reduce(lambda x, y: (x[0] + y[0], y[1]) if x[0] < 5 else (x, y), s)})
    

    Then do some munging to modify the elements to a list type:

    df2['sample_category'] = df2['sample_category'].apply(
    lambda x: [x] if isinstance(x[0], int) else list(x))
    

    Then explode, extract columns and finally drop the intermediate column 'sample_category'

    df2 = df2.explode('sample_category', ignore_index=True)
    df2['Sample'] = df2['sample_category'].str[0]
    df2['Category'] = df2['sample_category'].str[1]
    df2.drop('sample_category', axis=1, inplace=True)
    

    print(df2):

       Sample Category
    0      10  A,B,123
    1       4  L,M,456
    2       5  P,Q,789
    3       8  S,T,123
    4       5  S,T,456
    5       9  X,Y,456
    6      18  X,Y,123