Search code examples
pythonpandaslambdagroup-bygrouping

Assign group number for each row, based on columns value ranges


I have some data, that needs to be clustered into groups. That should be done by a few predefined conditions.

Suppose we have the following table:

d = {'ID': [100, 101, 102, 103, 104, 105],
     'col_1': [12, 3, 7, 13, 19, 25],
     'col_2': [3, 1, 3, 3, 2, 4]
    }


df = pd.DataFrame(data=d)
df.head()

enter image description here

Here, I want to group ID based on the following ranges, conditions, on col_1 and col_2.

For col_1 I divide values into following groups: [0, 10], [11, 15], [16, 20], [20, +inf]

For col_2 just use the df['col_2'].unique() values: [1], [2], [3], [4].

The desired grouping is in group_num column:

enter image description here

notice, that 0 and 3 rows have the same group number and the order, in which group number is assigned.

For now, I only came up with if-elif function to pre-define all the groups. It's not the solution for now cause in my real task there are far more ranges and confitions.

My code snippet, if it's relevant:

# This logic is not working cause here I have to predefine all the groups configurations, aka numbers,
# but I want to make groups "dymanicly":
# first group created and if the next row is not in that group -> create new one 

def grouping(val_1, val_2):
    
    # not using match case here, cause my Python < 3.10
    if ((val_1 >= 0) and (val_1 <10)) and (val_2 == 1):
        return 1
    elif ((val_1 >= 0) and (val_1 <10)) and (val_2 == 2):
        return 2
    elif ...
    
    ...

df['group_num'] = df.apply(lambda x: grouping(x.col_1, x.col_2), axis=1)

Solution

  • make dataframe for chking group

    bins = [0, 10, 15, 20, float('inf')]
    df1 = df[['col_1', 'col_2']].assign(col_1=pd.cut(df['col_1'], bins=bins, right=False)).sort_values(['col_1', 'col_2'])
    

    df1

        col_1           col_2
    1   [0.0, 10.0)     1
    2   [0.0, 10.0)     3
    0   [10.0, 15.0)    3
    3   [10.0, 15.0)    3
    4   [15.0, 20.0)    2
    5   [20.0, inf)     4
    



    chk group by df1

    df1.ne(df1.shift(1)).any(axis=1).cumsum()
    

    output:

    1    1
    2    2
    0    3
    3    3
    4    4
    5    5
    dtype: int32
    



    make output to group_num column

    df.assign(group_num=df1.ne(df1.shift(1)).any(axis=1).cumsum())
    

    result:

        ID  col_1   col_2   group_num
    0   100 12      3       3
    1   101 3       1       1
    2   102 7       3       2
    3   103 13      3       3
    4   104 19      2       4
    5   105 25      4       5