Example data
id | Gender | Age |
---|---|---|
1 | F | 22 |
2 | Fem | 18 |
3 | male | 45 |
4 | She/Her | 30 |
5 | Male | 25 |
6 | Non-bianary | 26 |
7 | M | 18 |
8 | female | 20 |
9 | Male | 56 |
I want to be able to standardise this somewhat by replacing all cells with an 'F' in them with 'Female', and all cells with 'M' in them with 'Male'. I know the first step is to cast the whole column into capitals
df.Gender = df.Gender.str.capitalize()
and I know that I can do it value-by-value with
df['Gender'] = df['Gender'].replace(['F', 'Fem', 'Female'], 'Female')
but is there a way to do this somewhat programmatically?
such as
df.Gender = df.Gender.str.capitalise()
for i in df.Gender:
if 'F' in str(i):
#pd.replace call something like...
df[df.Gender == i] = 'Female'
#I know that line is very wrong
elif 'M' in str(i)...
Try using regex:
import re
df["Gender"] = df["Gender"].str.replace(
r"^F\S*$", "Female", flags=re.I, regex=True
)
print(df)
Prints:
id Gender Age
0 1 Female 22
1 2 Female 18
2 3 male 45
3 4 She/Her 30
4 5 Male 25
5 6 Non-bianary 26
6 7 M 18
7 8 Female 20
8 9 Male 56