How do you group a dataframe based on a column with string values?

I am having some trouble using groupby to group a df based on the numeric value in a string. The regular expression (\w+)_\w+ should match to the digit in the string with which I wish to make a group, however I am unsure how to implement this with groupby.

Any assistance would be appreciated.

Data:

import pandas as pd

df = pd.DataFrame({'x':['ab_c_1.0','ab_c_1.1','ab_c_12.0','ab_c_12.1','ab_c_123.0','ab_c_123.1']})

Desired groupings:

             x
0     ab_c_1.0
1     ab_c_1.1

2    ab_c_12.0
3    ab_c_12.1

4   ab_c_123.0
5   ab_c_123.1

Solution

One of the options would be to extract the three parts then ask for their ngroup :

grp = df["x"].str.extract(r"(\D+)_(\d+)\.(\d+)").groupby([0,1]).ngroup().add(1)

#[1, 1, 2, 2, 3, 3]

Another variant with split, duplicated & cumsum :

grp = (~df["x"].str.split(".").str[0].duplicated()).cumsum()

#[1, 1, 2, 2, 3, 3]

If you need to assign the group numbers to a separate/new column, use df["group"] = grp.

Output :

for _, g in df.groupby(grp):
    print(g)
          x
0  ab_c_1.0
1  ab_c_1.1
           x
2  ab_c_12.0
3  ab_c_12.1
            x
4  ab_c_123.0
5  ab_c_123.1