I have a messy string variable containing stage information, I want to create a cleaner string with fewer groups. The current dataframe looks like this:
cohort = pd.DataFrame({'stage_group': ['XXX Stage I', 'Stage II XXX', 'Stage III XXX', 'XX Stage IV XXX', 'NA']},index=[1, 2, 3, 4, 5])
My ideal variable is 3 levels: Stage I-III, Stage IV and unknown:
cohort2 = pd.DataFrame({'stage_group': ['XXX Stage I', 'Stage II XXX', 'Stage III XXX', 'XX Stage IV XXX','NA'],'stage': ['Stage I', 'Stage II', 'Stage III', 'Stage IV', 'Unknown']},index=[1, 2, 3, 4, 5])
I tried the following codes but they did not assign group correctly (I just got Stage I-III and unknown). Any suggestions would be helpful.
searchfor = ['Stage I', 'Stage II', 'Stage III']
cohort['stage'] = pd.np.where(cohort.stage_group.str.contains('|'.join(searchfor)), "Stage I-III",
pd.np.where(cohort.stage_group.str.contains('Stage IV'), "Stage IV", "Unkown"))
Code works for me if I change order because Stage IV
contains also Stage I
so Stage IV
has to be checked before Stage I
import pandas as pd
data = {'stage_group': '''XXX Stage I
Stage II XXX
Stage III XXX
XX Stage IV XXX
NA'''.split('\n')
}
cohort = pd.DataFrame(data)
print(cohort)
searchfor = ['Stage I', 'Stage II', 'Stage III']
cohort['stage'] = pd.np.where(cohort.stage_group.str.contains('Stage IV'), "Stage IV",
pd.np.where( cohort.stage_group.str.contains('|'.join(searchfor)), "Stage I-III", "Unkown"))
print(cohort)
Result
stage_group
0 XXX Stage I
1 Stage II XXX
2 Stage III XXX
3 XX Stage IV XXX
4 NA
stage_group stage
0 XXX Stage I Stage I-III
1 Stage II XXX Stage I-III
2 Stage III XXX Stage I-III
3 XX Stage IV XXX Stage IV
4 NA Unkown