Search code examples
pythonpandasstringconditional-statements

Pandas - Replace cell values using a conditional (normalising string input for gender)


Example data

id Gender Age
1 F 22
2 Fem 18
3 male 45
4 She/Her 30
5 Male 25
6 Non-bianary 26
7 M 18
8 female 20
9 Male 56

I want to be able to standardise this somewhat by replacing all cells with an 'F' in them with 'Female', and all cells with 'M' in them with 'Male'. I know the first step is to cast the whole column into capitals

df.Gender = df.Gender.str.capitalize()

and I know that I can do it value-by-value with

df['Gender'] = df['Gender'].replace(['F', 'Fem', 'Female'], 'Female')

but is there a way to do this somewhat programmatically?

such as

df.Gender = df.Gender.str.capitalise()

for i in df.Gender:
    if 'F' in str(i):
        #pd.replace call something like...
        df[df.Gender == i] = 'Female'
        #I know that line is very wrong
    elif 'M' in str(i)...


Solution

  • Try using regex:

    import re
    
    df["Gender"] = df["Gender"].str.replace(
        r"^F\S*$", "Female", flags=re.I, regex=True
    )
    print(df)
    

    Prints:

       id       Gender  Age
    0   1       Female   22
    1   2       Female   18
    2   3         male   45
    3   4      She/Her   30
    4   5         Male   25
    5   6  Non-bianary   26
    6   7            M   18
    7   8       Female   20
    8   9         Male   56