Search code examples
regexpandaspython-re

How to exclude prefixed substrings from column values


My goal is to exclude the rows prefixed by No, for example I'm interested by the word commerce, then I have to keep all rows contains commerce and exclude no commerce. I'm looking for an generic solution

data = {
'description':[
    'Commerce foo', 
    'foo commercial',
    'foo no commercial',
    'foo commercial boo',
    'No commerce',
    'foo restaurant',
    'boo no restaurant']
}

df = pd.DataFrame(data)

    ______________________
    | description        |
__________________________
| 0 | Commerce foo       |
| 1 | foo commercial     |
| 2 | foo no commercial  |
| 3 | foo commercial boo |
| 4 | No commerce        |
| 5 | foo restaurant     |
| 6 | boo no restaurant  |
__________________________

I tried (not working)

search = ['restaurant', 'commerce', 'commercial']

df['description'].str.findall(f'̂̂̂^(?<=no\s)({"|".join(search)})', flags=re.IGNORECASE)

What I'm looking for

    ______________________
    | description        |
__________________________
| 0 | Commerce foo       |
| 1 | foo commercial     |
| 3 | foo commercial boo |
| 5 | foo restaurant     |
__________________________

Solution

  • The pattern currently starts with ^(?<=no\s) which asserts that from the start of the string there should be no followed by a whitespace char directly to the left. That will not match.

    If you want to filter the rows, you can use str.contains instead of findall.

    You can remove the anchor ^ and change the assertion to (?<!no\s) what will assert that there is not no followed by a whitespace char directly to the left and add word boundaries to the left and right to prevent partial word matches.

    For a match only, you can change the capture group to a non capturing one (?:

    If you print the pattern, it will look like

    \b(?<!no\s)(?:restaurant|commerce|commercial)\b
    

    See a regex demo

    data = {
        'description':[
            'Commerce foo',
            'foo commercial',
            'foo no commercial',
            'foo commercial boo',
            'No commerce',
            'foo restaurant',
            'boo no restaurant']
    }
    
    df = pd.DataFrame(data)
    search = ['restaurant', 'commerce', 'commercial']
    print(df[df['description'].str.contains(fr'\b(?<!no\s)(?:{"|".join(search)})\b', flags=re.IGNORECASE)])
    

    Output

              description
    0        Commerce foo
    1      foo commercial
    3  foo commercial boo
    5      foo restaurant