Search code examples
pythonpandasregexgroup-by

How do you group a dataframe based on a column with string values?


I am having some trouble using groupby to group a df based on the numeric value in a string. The regular expression (\w+)_\w+ should match to the digit in the string with which I wish to make a group, however I am unsure how to implement this with groupby.

Any assistance would be appreciated.

Data:

import pandas as pd

df = pd.DataFrame({'x':['ab_c_1.0','ab_c_1.1','ab_c_12.0','ab_c_12.1','ab_c_123.0','ab_c_123.1']})

Desired groupings:

             x
0     ab_c_1.0
1     ab_c_1.1

2    ab_c_12.0
3    ab_c_12.1

4   ab_c_123.0
5   ab_c_123.1

Solution

  • One of the options would be to extract the three parts then ask for their ngroup :

    grp = df["x"].str.extract(r"(\D+)_(\d+)\.(\d+)").groupby([0,1]).ngroup().add(1)
    
    #[1, 1, 2, 2, 3, 3]
    

    Another variant with split, duplicated & cumsum :

    grp = (~df["x"].str.split(".").str[0].duplicated()).cumsum()
    
    #[1, 1, 2, 2, 3, 3]
    

    If you need to assign the group numbers to a separate/new column, use df["group"] = grp.

    Output :

    for _, g in df.groupby(grp):
        print(g)
              x
    0  ab_c_1.0
    1  ab_c_1.1
               x
    2  ab_c_12.0
    3  ab_c_12.1
                x
    4  ab_c_123.0
    5  ab_c_123.1