Search code examples
pythonregexword-boundary

Why does my regex with word boundary fail?


I'd like to match number, positive or negative, possibly with currency sign in front. But I don't want something like PSM-9. My code is:

test='AAA PCSK-9, $111 -3,33'
re.findall(r'\b-?[$€£]?-?\d+[\d,.]*\b', test)

Output is:['-9', '111', '3,33'] Could someone explain why -9 is matched? Thank you in advance.

Edit: I don't any part of PCSK-9 is matched it is like a name of a product rather a number. So my desired output is:

['111', '3,33']

Solution

  • The word boundary matches between the K and the dash. The 2 parts after the dash [$€£]?-? are optional because of the questionmark and then you match one or more times a digit. This results in the match -9

    What you might use instead of a word boundary is an assertion that checks if what is before and after the match is not a non whitespace character \S using a negative lookbehind and a negative lookahead.

    (?<!\S)-?[$€£]?(\d+(?:[,.]\d+)?)(?!\S)

    Regex demo | Python demo