Search code examples
pythonregexpandasregex-lookarounds

How to extract a word before another word in pandas


I've got this pandas DataFrame, which is a description of plays during a football match:

play_id type Text
1 pass Jon pass complete to Ben.
2 pass Clock 14:52, Jon pass complete to Mitch.
3 rush Frank rush.

My objective is to create a new column called "passer" with a script that will go through the description in the "text' column, and will take the name that is placed before the word 'pass'. So I first used this:

df['passer'] = df['Text'].str.extract(r'(.*?)pass', expand=False).str.strip()

Which gives me this:

play_id type Text passer
1 pass Jon pass complete to Ben. Jon
2 pass Clock 14:52, Jon pass complete to Mitch. Clock 14:52, Jon
3 rush Frank rush. NaN

It works correctly for the 1st and 3rd playid, but not for the second one, as it takes the clock, that can sometimes be included in the description.

I have tried to implement conditions on the creation of my column, where the code checks if 'Clock' is included in the description or not, and use the correct regex, but this does not work:

conditions = [
    (np.where(df.Text.str.contains('Clock', case=False))),
    (np.where(~df.Text.str.contains('Clock', case=False)))
    ]

choices = [
    df['Text'].str.extract(r', (.*?) pass', expand=False).str.strip(), 
    df['Text'].str.extract('(.*?) pass', expand=False).str.strip()
    ]

df['passerNEW'] = np.select(conditions, choices, default='NaN')
df

I get the following error:

TypeError: invalid entry 0 in condlist: should be boolean ndarray

Is there a way to make this function work? That seemed like a good way to do it, as in other cases I can have three different conditions to check in order to know which regex to use.


Solution

    • Use pandas.Series.str.extract with a positive lookahead conditional.
    • flags=re.IGNORECASE is used to ignore the case of 'pass'
      • df.Text.str.lower().str.extract('(\w+(?=\s+pass))') can be used instead of importing re for the flag.
    import pandas as pd
    import re
    
    # test dataframe
    data = {'play_id': ['1', '2', '3'], 'type': ['pass', 'pass', 'rush'], 'Text': ['Jon PASS complete to Ben.', 'Clock 14:52, Jon pass complete to Mitch.', 'Frank rush.']}
    df = pd.DataFrame(data)
    
    # display(df)
    play_id type                                     Text
          1 pass                Jon PASS complete to Ben.
          2 pass Clock 14:52, Jon pass complete to Mitch.
          3 rush                              Frank rush.
    
    # extract
    df['passer'] = df.Text.str.extract('(\w+(?=\s+pass))', flags=re.IGNORECASE)
    
    # display(df)
    play_id type                                     Text passer
          1 pass                Jon PASS complete to Ben.    Jon
          2 pass Clock 14:52, Jon pass complete to Mitch.    Jon
          3 rush                              Frank rush.    NaN