I want to write a regular expression to extract a pattern from Pandas DataFrame using str.extract that extracts the first match of the pattern found between a START word and ends with one of the two possible STOP words.
example# 1: START hello there STOP WORD
example# 2: START Good morning ANOTHER DELIMITER
In the first case, I want to return 'hello there' , and in the second case 'Good morning'
If there is only one stop word at the end, like in example 1, the following regular expression within str.extract works. But how do I combine two STOP words?
r'(?s)START(.*?)STOP\s+WORD'
Use the following regex alternation:
\bSTART\s+(.*?)\s+(?:STOP WORD|ANOTHER DELIMITER)\b
Pandas code:
df["match"] = df["col"].str.extract(r'\bSTART\s+(.*?)\s+(?:STOP WORD|ANOTHER DELIMITER)\b')