Search code examples
pythonregexsplitpython-retext-extraction

Regex to get previous word followed by a phrase in python


I need to extract the word local when it comes before gun store. But, the below function is not returning it because of using split. Is there any way to get around this?

Source looks like this: As reported on 30 December 2019, in Maipu, Metropolitan region, a group of at least 10 rioters attempted to loot a local gun store.

Here is the function:

    regex_filter = r'local|dozen|several|looted'
    property_key = r"\b(gun store|establishments|supermarket)\b"
    source= source.split()
    for i, w in enumerate(source):
        if (re.search(property_key, w)):
            if re.match(re.compile(regex_filter, flags=re.IGNORECASE), source[i-1]):
                return source[i-1]```

Solution

  • I suggest extracting the word preceding any of the words listed in property_key with

    re.search(r"(\S+)\s+(?:gun store|establishments|supermarket)\b", text)
    

    Or, if the word is formed with word chars and there can be any whitespace/punctuation between the words:

    re.search(r"([^\W_]+)[\W_]+(?:gun store|establishments|supermarket)\b", text)
    

    See the regex demo.

    The (\S+)\s+ matches and captures one or more non-whitespace chars into Group 1 and then matches one or more whitespace chars, while ([^\W_]+)[\W_]+ matches and captures one or more letters or digits into Group 1 and then one or more non-word or underscore chars are matched.

    See the Python demo:

    import re
    rx = r"(\S+)\s+(?:gun store|establishments|supermarket)\b"
    text = "As reported on 30 December 2019, in Maipu, Metropolitan region, a group of at least 10 rioters attempted to loot a local gun store."
    m = re.search(rx, text)
    if m:
        print(m.group(1))
    
    # => local