I'm trying to come up with a proper regex pattern (and I am very bad at it) for the strings that I have. Each time I end up with something that only works partly. I'll show the pattern that I made later below, but first, I want to specify what I want to extract out of a text.
Data:
From that data I need only to extract the amount of money a company receives (including $/€ etc, and a specification of currency if it's there, like Canadians dollars (CAD)).
So, in result, I expect to receive this:
The pattern that I use (throw rotten tomatoes at me):
try:
pattern = '(\bAU|\bUSD|\bUS|\bCHF)*\s*[\$\€\£\¥\₣\₹\?]\s*\d*\.?\d*\s*(K|M)*[(B|M)illion]*'
raises = re.search(pattern, text, re.IGNORECASE) # text – a row of data mentioned above
raises = raises.group().upper().strip()
print(raises)
except:
raises = '???'
print(raises)
Also, sometimes the pattern that works in online python regex editor, will not work in actual script.
Some issues in your regex:
The list of currency acronyms (AU USD US CHF) is too limited. It will not match JPY, nor many other acronyms. Maybe allow any word of 2-3 capitals.
Not a problem, but there is no need to escape the currency symbols with a backslash.
The \?
in the currency list is not a currency symbol.
The regex requires both a currency acronym as a currency symbol. Maybe you intended to make the currency symbol optional with \?
but then that the ?
should appear unescaped after the character class, and there should still be a possibility to not have the acronym and only the symbol.
The regex requires that the number has decimals. This should be made optional.
(K|M)*
will allow KKKKKKK
. You don't want a *
here.
[(B|M)illion]*
will allow the letters BMilon
, a literal pipe and literal parentheses to occur in any order and any number. Like it will match "in" and "non" and "(BooM)"
The previous two mentioned patterns are put in sequence, while they should be mutually exclusive.
The regex does not provide for matching the final "s" in "millions".
Here is a correction:
(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?
On regex101
In Python syntax:
pattern = r"(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?"