Search code examples
pythonpandascontainsdata-manipulation

python create string variable based on a another string variable (contains)


I have a messy string variable containing stage information, I want to create a cleaner string with fewer groups. The current dataframe looks like this:

cohort = pd.DataFrame({'stage_group': ['XXX Stage I', 'Stage II XXX', 'Stage III XXX', 'XX Stage IV XXX', 'NA']},index=[1, 2, 3, 4, 5])

My ideal variable is 3 levels: Stage I-III, Stage IV and unknown:

cohort2 = pd.DataFrame({'stage_group': ['XXX Stage I', 'Stage II XXX', 'Stage III XXX', 'XX Stage IV XXX','NA'],'stage': ['Stage I', 'Stage II', 'Stage III', 'Stage IV', 'Unknown']},index=[1, 2, 3, 4, 5])

I tried the following codes but they did not assign group correctly (I just got Stage I-III and unknown). Any suggestions would be helpful.

searchfor = ['Stage I', 'Stage II', 'Stage III']

cohort['stage'] = pd.np.where(cohort.stage_group.str.contains('|'.join(searchfor)), "Stage I-III",
                  pd.np.where(cohort.stage_group.str.contains('Stage IV'), "Stage IV", "Unkown"))

Solution

  • Code works for me if I change order because Stage IV contains also Stage I so Stage IV has to be checked before Stage I

    import pandas as pd
    
    data = {'stage_group': '''XXX Stage I 
    Stage II XXX
    Stage III XXX
    XX Stage IV XXX
    NA'''.split('\n')
    }
    
    cohort = pd.DataFrame(data)
    
    print(cohort)
    
    searchfor = ['Stage I', 'Stage II', 'Stage III']
    
    cohort['stage'] = pd.np.where(cohort.stage_group.str.contains('Stage IV'), "Stage IV",
                      pd.np.where( cohort.stage_group.str.contains('|'.join(searchfor)), "Stage I-III", "Unkown"))
    
    print(cohort)
    

    Result

           stage_group
    0     XXX Stage I 
    1     Stage II XXX
    2    Stage III XXX
    3  XX Stage IV XXX
    4               NA
    
           stage_group        stage
    0     XXX Stage I   Stage I-III
    1     Stage II XXX  Stage I-III
    2    Stage III XXX  Stage I-III
    3  XX Stage IV XXX     Stage IV
    4               NA       Unkown