I've got this pandas DataFrame, which is a description of plays during a football match:
play_id | type | Text |
---|---|---|
1 | pass | Jon pass complete to Ben. |
2 | pass | Clock 14:52, Jon pass complete to Mitch. |
3 | rush | Frank rush. |
My objective is to create a new column called "passer" with a script that will go through the description in the "text' column, and will take the name that is placed before the word 'pass'. So I first used this:
df['passer'] = df['Text'].str.extract(r'(.*?)pass', expand=False).str.strip()
Which gives me this:
play_id | type | Text | passer |
---|---|---|---|
1 | pass | Jon pass complete to Ben. | Jon |
2 | pass | Clock 14:52, Jon pass complete to Mitch. | Clock 14:52, Jon |
3 | rush | Frank rush. | NaN |
It works correctly for the 1st and 3rd playid, but not for the second one, as it takes the clock, that can sometimes be included in the description.
I have tried to implement conditions on the creation of my column, where the code checks if 'Clock' is included in the description or not, and use the correct regex, but this does not work:
conditions = [
(np.where(df.Text.str.contains('Clock', case=False))),
(np.where(~df.Text.str.contains('Clock', case=False)))
]
choices = [
df['Text'].str.extract(r', (.*?) pass', expand=False).str.strip(),
df['Text'].str.extract('(.*?) pass', expand=False).str.strip()
]
df['passerNEW'] = np.select(conditions, choices, default='NaN')
df
I get the following error:
TypeError: invalid entry 0 in condlist: should be boolean ndarray
Is there a way to make this function work? That seemed like a good way to do it, as in other cases I can have three different conditions to check in order to know which regex to use.
pandas.Series.str.extract
with a positive lookahead conditional.
lookahead
.\s+pass
will work for any amount of whitespace between the name and 'pass'
(e.g. 'Jon PaSs'
)flags=re.IGNORECASE
is used to ignore the case of 'pass'
df.Text.str.lower().str.extract('(\w+(?=\s+pass))')
can be used instead of importing re
for the flag.import pandas as pd
import re
# test dataframe
data = {'play_id': ['1', '2', '3'], 'type': ['pass', 'pass', 'rush'], 'Text': ['Jon PASS complete to Ben.', 'Clock 14:52, Jon pass complete to Mitch.', 'Frank rush.']}
df = pd.DataFrame(data)
# display(df)
play_id type Text
1 pass Jon PASS complete to Ben.
2 pass Clock 14:52, Jon pass complete to Mitch.
3 rush Frank rush.
# extract
df['passer'] = df.Text.str.extract('(\w+(?=\s+pass))', flags=re.IGNORECASE)
# display(df)
play_id type Text passer
1 pass Jon PASS complete to Ben. Jon
2 pass Clock 14:52, Jon pass complete to Mitch. Jon
3 rush Frank rush. NaN