Search code examples
pythonpandaspython-re

Pandas DataFrame extract between one START word and multiple STOP words


I want to write a regular expression to extract a pattern from Pandas DataFrame using str.extract that extracts the first match of the pattern found between a START word and ends with one of the two possible STOP words.

example# 1: START hello there STOP WORD

example# 2: START Good morning ANOTHER DELIMITER

In the first case, I want to return 'hello there' , and in the second case 'Good morning'

If there is only one stop word at the end, like in example 1, the following regular expression within str.extract works. But how do I combine two STOP words?

r'(?s)START(.*?)STOP\s+WORD'


Solution

  • Use the following regex alternation:

    \bSTART\s+(.*?)\s+(?:STOP WORD|ANOTHER DELIMITER)\b
    

    Pandas code:

    df["match"] = df["col"].str.extract(r'\bSTART\s+(.*?)\s+(?:STOP WORD|ANOTHER DELIMITER)\b')