Search code examples
pythonregexpython-re

Proper regex (re) pattern in python


I'm trying to come up with a proper regex pattern (and I am very bad at it) for the strings that I have. Each time I end up with something that only works partly. I'll show the pattern that I made later below, but first, I want to specify what I want to extract out of a text.

Data:

  • Company Fragile9 Closes €9M Series B Funding
  • Appplle21 Receives CAD$17.5K in Equity Financing
  • Cat Raises $10.8 Millions in Series A Funding
  • Sun Raises EUR35M in Funding at a $1 Billion Valuation
  • Japan1337 Announces JPY 1.78 Billion Funding Round

From that data I need only to extract the amount of money a company receives (including $/€ etc, and a specification of currency if it's there, like Canadians dollars (CAD)).

So, in result, I expect to receive this:

  • €9M
  • CAD$17.5K
  • $10.8 Millions
  • EUR35M
  • JPY 1.78 Billion

The pattern that I use (throw rotten tomatoes at me):

try:
    pattern = '(\bAU|\bUSD|\bUS|\bCHF)*\s*[\$\€\£\¥\₣\₹\?]\s*\d*\.?\d*\s*(K|M)*[(B|M)illion]*'
    raises = re.search(pattern, text, re.IGNORECASE) # text – a row of data mentioned above
    raises = raises.group().upper().strip()
    print(raises)
except:
    raises = '???'
    print(raises)

Also, sometimes the pattern that works in online python regex editor, will not work in actual script.


Solution

  • Some issues in your regex:

    • The list of currency acronyms (AU USD US CHF) is too limited. It will not match JPY, nor many other acronyms. Maybe allow any word of 2-3 capitals.

    • Not a problem, but there is no need to escape the currency symbols with a backslash.

    • The \? in the currency list is not a currency symbol.

    • The regex requires both a currency acronym as a currency symbol. Maybe you intended to make the currency symbol optional with \? but then that the ? should appear unescaped after the character class, and there should still be a possibility to not have the acronym and only the symbol.

    • The regex requires that the number has decimals. This should be made optional.

    • (K|M)* will allow KKKKKKK. You don't want a * here.

    • [(B|M)illion]* will allow the letters BMilon, a literal pipe and literal parentheses to occur in any order and any number. Like it will match "in" and "non" and "(BooM)"

    • The previous two mentioned patterns are put in sequence, while they should be mutually exclusive.

    • The regex does not provide for matching the final "s" in "millions".

    Here is a correction:

    (?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?
    

    On regex101

    In Python syntax:

    pattern = r"(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?"