I have dataframe (df) column called A which is string
index A
----------------------------
0 boy_was_born_in_2010
1 men_was_born_in_1997
2 girl_this_is_2022
3 this_is_a_lady
4 how_tall_is_this_boy
5 girl_is_studying
Now I wrote a code which would identify specific words like boy, girl, men, lady if it found that keyword in the column A create a new column B, given below which is if the string A contain boy then B will have Kid as a value and contain men->Male, girl->Kid, lady->female
index A B
------------------------------------------
0 boy_was_born_in_2010 Kid
1 men_was_born_in_1997 Male
2 girl_this_is_2022 Kid
3 this_is_a_lady Female
4 how_tall_is_this_boy Kid
5 girl_is_studying Kid
I have used the following code
df['B']=df.A.str.findall('boy').transform(''.join).replace('boy','Kid')
this is working fine but when i apply it for other rows than the previously applied value which is the above somehow gets undo, moreover I dont think this is a optimized way
also tried this
df['B'] = df['A'].replace(to_replace ='^boy\W', value = 'Kid', regex = True) # not working
if you have anyother way of doing it please suggest I'am a complete beginner just started working on NLP projects
Adding an example with multiple matches:
import pandas as pd
data = {'A': ['boy_was_born_in_2010',
'men_was_born_in_1997',
'girl_this_is_2022',
'this_is_a_lady',
'how_tall_is_this_boy',
'girl_is_studying',
'boy meets men']}
df = pd.DataFrame(data)
print(df)
A
0 boy_was_born_in_2010
1 men_was_born_in_1997
2 girl_this_is_2022
3 this_is_a_lady
4 how_tall_is_this_boy
5 girl_is_studying
6 boy meets men
Set up a dict for mapping, use str.extract
on pattern
with alternatives, sqeeuze
the result, and finally apply map
:
a_dict = {'boy':'Kid',
'men':'Male',
'girl':'Kid',
'lady':'Female'}
pattern = '(' + '|'.join(list(a_dict)) + ')'
df['B'] = df.A.str.extract(pattern).squeeze().map(a_dict)
print(df)
A B
0 boy_was_born_in_2010 Kid
1 men_was_born_in_1997 Male
2 girl_this_is_2022 Kid
3 this_is_a_lady Female
4 how_tall_is_this_boy Kid
5 girl_is_studying Kid
6 boy meets men Kid
Above method will only get you the first match. E.g. boy
-> Kid
for last string. If you want to get all matches, we can use something like this:
temp = df.A.str.extractall(pattern).squeeze().map(a_dict)
df['B'] = temp.groupby(level=0).agg(', '.join)
print(df)
A B
0 boy_was_born_in_2010 Kid
1 men_was_born_in_1997 Male
2 girl_this_is_2022 Kid
3 this_is_a_lady Female
4 how_tall_is_this_boy Kid
5 girl_is_studying Kid
6 boy meets men Kid, Male