I am having some trouble using groupby
to group a df
based on the numeric value in a string. The regular expression (\w+)_\w+
should match to the digit in the string with which I wish to make a group, however I am unsure how to implement this with groupby
.
Any assistance would be appreciated.
Data:
import pandas as pd
df = pd.DataFrame({'x':['ab_c_1.0','ab_c_1.1','ab_c_12.0','ab_c_12.1','ab_c_123.0','ab_c_123.1']})
Desired groupings:
x
0 ab_c_1.0
1 ab_c_1.1
2 ab_c_12.0
3 ab_c_12.1
4 ab_c_123.0
5 ab_c_123.1
One of the options would be to extract
the three parts then ask for their ngroup
:
grp = df["x"].str.extract(r"(\D+)_(\d+)\.(\d+)").groupby([0,1]).ngroup().add(1)
#[1, 1, 2, 2, 3, 3]
Another variant with split
, duplicated
& cumsum
:
grp = (~df["x"].str.split(".").str[0].duplicated()).cumsum()
#[1, 1, 2, 2, 3, 3]
If you need to assign the group numbers to a separate/new column, use df["group"] = grp
.
Output :
for _, g in df.groupby(grp):
print(g)
x
0 ab_c_1.0
1 ab_c_1.1
x
2 ab_c_12.0
3 ab_c_12.1
x
4 ab_c_123.0
5 ab_c_123.1