Search code examples
pythonregexpython-2.7regex-lookaroundslookbehind

Using regular expression to find specific strings between parentheses (including parentheses)


I am trying to use regular expression to find specific strings between parentheses in a string like the one below:

foo = '((peach W/O juice) OR apple OR (pear W/O water) OR kiwi OR (lychee AND sugar) OR (pineapple W/O salt))'

Specifically, I want to find only (peach W/O juice), (pear W/O water), and (pineapple W/O salt).

I tried lookahead and lookbehind, but was unable to obtain the correct results.

For example, when I do the following RegEx:

import re
regex = '(?<=[\s\(])\([^\)].*\sW/O\s[^\)].*\)(?=[\)\s])'
re.findall(regex, foo)

I end up with the entire string:

['(peach W/O juice) OR apple OR (pear W/O water) OR kiwi OR (lychee AND sugar) OR (pineapple W/O salt)']

EDIT:

I found the problem:

Instead of [\)].*, I should do [\)]*, which would give me the correct result:

regex = '(?<=[\s\(])\([^\)]*\sW/O\s[^\)]*\)(?=[\)\s])'

re.findall(regex, foo)
['(peach W/O juice)', '(pear W/O water)', '(pineapple W/O salt)']

Solution

  • I think your problem is that your .* operators are being greedy - they will consume as much as they can if you don't put a ? after them: .*?. Also, note that since you want the parentheses, you shouldn't need the lookahead/lookbehind operations; they will exclude the parentheses they find.

    Instead of fully debugging your regex, I decided to just rewrite it:

    >>> import re
    >>> foo ='((peach W/O juice) OR apple OR (pear W/O water) OR kiwi OR (lychee AND sugar) OR (pineapple W/O salt))'
    >>> regex = '\([a-zA-Z ]*?W/O.*?\)'
    >>> re.findall(regex, foo)
    ['(peach W/O juice)', '(pear W/O water)', '(pineapple W/O salt)']
    

    Here's the breakdown:

    \( captures the leading parentheses - note that it's escaped

    [a-zA-Z ] captures all alphabetical characters and a space (note the space after Z before the closing bracket) I used this instead of . so that no other parentheses will be captured. Using the period operator would cause (lychee AND sugar) OR (pineapple W/O salt) to be captured as one match.

    *? the * causes the characters in the bracket to match 0 or more times, but the ? says to only capture as many as you need to make a match

    W/O captures the "W/O" that you're looking for

    .*? captures any more characters (again, non-greedy because of ?)

    \) captures the trailing parenthesese