python pandas dataframe data-manipulation

Aggregate a dataframe column based on a hierarichal condition from another column

I have an interesting problem and thought I will share it here for everyone. Let's assume we have a pandas DataFrame like this (dummy data):

Category	Samples
A,B,123	6
A,B,456	3
A,B,789	1
X,Y,123	18
X,Y,456	7
X,Y,789	2
P,Q,123	1
P,Q,456	2
P,Q,789	2
L,M,123	1
L,M,456	3
S,T,123	5
S,T,456	5
S,T,789	3

The value in category are basically hierarchal in nature. Think of A as country, B as state, and 123 as zip-code. What I want is to greedily match for each category which has less than 5 samples and merge it with the nearest one. The final example DataFrame should be like:

Category	Samples
A,B,123	10
X,Y,123	18
X,Y,456	9
P,Q,456	5
L,M,456	4
S,T,123	8
S,T,456	5

These are the possible rules I see that will be needed:

Case A,B : Sub-categories 456, 789 have less than 5 categories so we merge it but then the merged one will also have 4 which is less than 5 so it gets further merged with 123 and thus finally we get A,B,123 with 10.

Case X,Y : Subcategory 789 is the only one with less than 5 so it will merge with the category 456 (the one closest to 5 samples) to become X,Y,456 as 9 where X,Y,123 always had more than 5 so it remains as is.

Case P,Q : Here all the sub-categories have less then 5 but the idea is to merge it one at a time and it has nothing to do with the sequence. 123 here is having one sample, so it will merge with 789 to form a sample size of 3 which is still less than 5 so it will merge with 456 to form P,Q,456 with sample size of 5 but it can also be P,Q,789 as 5. Either is fine.

Case L,M : Only two sub-categories and both even merged will remain less than 5 but that's the best we can have so it should be L,M,456 as 4.

Case S,T : Only 789 has less than 5 so it can go with either 123 or 456 (as both have same samples), but not both. So the answer should be either S,T,123 as 5 and S,T,456 as 8 or S,T,123 as 8 and S,T,456 as 5.

What happens if there is a third column with values and based on the logic we want them to be merged too - add up if its an integer, and concatenate if that's a string - based on whatever condition we use on these columns?

I have been trying to break the column category then work with samples to add up but so far no luck. Any help is greatly appreciated.

Solution

Very tricky question especially with the structure of your data(because your grouper which is really the parts "A,B", "X,Y", etc. are not in a separate column. But I think you can do:

df.sort_values(by='Samples', inplace=True, ignore_index=True)
#grouper containing groupby keys ['A,B', 'X,Y', etc.)
g = df['Category'].str.extract("(.*),+")[0] 
#create a column to keep the category together
df['sample_category'] = list(zip(df['Samples'], df['Category']))

Then use functools.reduce to reduce the list by iteratively grabbing the next tuple if sample is less than 5:

df2 = df.groupby(g, as_index=False).agg(
{'sample_category': lambda s: 
 functools.reduce(lambda x, y: (x[0] + y[0], y[1]) if x[0] < 5 else (x, y), s)})

Then do some munging to modify the elements to a list type:

df2['sample_category'] = df2['sample_category'].apply(
lambda x: [x] if isinstance(x[0], int) else list(x))

Then explode, extract columns and finally drop the intermediate column 'sample_category'

df2 = df2.explode('sample_category', ignore_index=True)
df2['Sample'] = df2['sample_category'].str[0]
df2['Category'] = df2['sample_category'].str[1]
df2.drop('sample_category', axis=1, inplace=True)

print(df2):

   Sample Category
0      10  A,B,123
1       4  L,M,456
2       5  P,Q,789
3       8  S,T,123
4       5  S,T,456
5       9  X,Y,456
6      18  X,Y,123