Search code examples
pythonpandasregexdataframenlp

creating new column using regex if certain keywords are found in other column values


I have dataframe (df) column called A which is string

 index   A
----------------------------
  0      boy_was_born_in_2010
  1      men_was_born_in_1997
  2      girl_this_is_2022
  3      this_is_a_lady
  4      how_tall_is_this_boy
  5      girl_is_studying

Now I wrote a code which would identify specific words like boy, girl, men, lady if it found that keyword in the column A create a new column B, given below which is if the string A contain boy then B will have Kid as a value and contain men->Male, girl->Kid, lady->female

 index   A                         B
------------------------------------------
  0      boy_was_born_in_2010      Kid
  1      men_was_born_in_1997      Male
  2      girl_this_is_2022         Kid
  3      this_is_a_lady            Female
  4      how_tall_is_this_boy      Kid
  5      girl_is_studying          Kid

I have used the following code

df['B']=df.A.str.findall('boy').transform(''.join).replace('boy','Kid')

this is working fine but when i apply it for other rows than the previously applied value which is the above somehow gets undo, moreover I dont think this is a optimized way

also tried this

df['B'] = df['A'].replace(to_replace ='^boy\W', value = 'Kid', regex = True) # not working

if you have anyother way of doing it please suggest I'am a complete beginner just started working on NLP projects


Solution

  • Adding an example with multiple matches:

    import pandas as pd
    
    data = {'A': ['boy_was_born_in_2010',
      'men_was_born_in_1997',
      'girl_this_is_2022',
      'this_is_a_lady',
      'how_tall_is_this_boy',
      'girl_is_studying',
      'boy meets men']}
    
    df = pd.DataFrame(data)
    print(df)
    
                          A
    0  boy_was_born_in_2010
    1  men_was_born_in_1997
    2     girl_this_is_2022
    3        this_is_a_lady
    4  how_tall_is_this_boy
    5      girl_is_studying
    6         boy meets men
    

    Set up a dict for mapping, use str.extract on pattern with alternatives, sqeeuze the result, and finally apply map:

    a_dict = {'boy':'Kid',
              'men':'Male',
              'girl':'Kid',
              'lady':'Female'}
    
    pattern = '(' + '|'.join(list(a_dict)) + ')'
    df['B'] = df.A.str.extract(pattern).squeeze().map(a_dict)
    
    print(df)
    
                          A       B
    0  boy_was_born_in_2010     Kid
    1  men_was_born_in_1997    Male
    2     girl_this_is_2022     Kid
    3        this_is_a_lady  Female
    4  how_tall_is_this_boy     Kid
    5      girl_is_studying     Kid
    6         boy meets men     Kid
    

    Above method will only get you the first match. E.g. boy -> Kid for last string. If you want to get all matches, we can use something like this:

    temp = df.A.str.extractall(pattern).squeeze().map(a_dict)
    df['B'] = temp.groupby(level=0).agg(', '.join)
    
    print(df)
    
                          A          B
    0  boy_was_born_in_2010        Kid
    1  men_was_born_in_1997       Male
    2     girl_this_is_2022        Kid
    3        this_is_a_lady     Female
    4  how_tall_is_this_boy        Kid
    5      girl_is_studying        Kid
    6         boy meets men  Kid, Male