Search code examples
pythonregexpython-3.5currency

Currency regular expression for GBP/USD/EUR


Need some help with creating a regular expression that satisfies the following rules. Any suggestions would be greatly appreciated.

(1a.) Optionally start with: £, $, €

1b. Currency value must start with one to three digits

2a. Before a comma: There must be one to three digits

2b. After a comma: There must be three digits

2c. After a decimal point: There can only be digits

3a. Currency value must end with one or many digits

[3b.] Value may be followed by: tn/Tn/Trillion/trillion, bn/Bn/Billion/billion, m/Million/million.

(3c.) Optionally end with: p/P/Pence/pence, c/C/Cents/cents, €/Euro(s)/euro(s), Dollars/dollars, Pounds/pounds.

Rule 1a and 3c are mutually exclusive, however one of them must be used:

$1 dollar ✘
1 ✘
$1 ✓
1 dollar ✓

Rule 3b may be used with Rule 1a or 3c, but does not need to be used:

$1 trillion ✓
1 trillion dollars ✓
$1 ✓

Rule 2a/2b may be used zero or many times:

$1 ✓
$1,000,000,000,000 ✓

Rule 2c may only be used once or zero times:

$1 ✓
$1.000 ✓

Expected Result:

$1 dollar ✘
1 ✘
$1,00000.000,000 ✘
1,000.00 ✘

$1 ✓
1 dollar ✓
$1 trillion ✓
1 trillion dollars ✓
$1,000,000,000,000 ✓
$1.000 ✓
$1,000,000,000,000.000000 ✓

Here's what I have so far:

[£€$]?[0-9]+[,.]?[0-9][pcm][ euros| euro]*

Solution

  • The following regex doesn't rely on the values being on separate lines but will also grab them from within a sentence.

    It also assumes that the units "cents", "dollars" and "pounds" can be singular.

    Plus it allows unlimited whitespace between the words and the number, and also no whitespace between the number and the following value word or unit.

    Explanation:

    The following is the basic structure of the regex with sub-expressions represented by values surrounded by two @s:

    (@Prefix@)?(?=(@Value@)(\s*@Postfix@)?)(?(1)\2(?!\3)|(?<!@Prefix@)\2\3)
    |________|    |_______||____________|  |______________________________|
        |             |           |                       |
     Group 1       Group 2    Group 3          Prefix-Postfix Selector
    

    Group 1 optionally matches the prefix.

    Group 2 and Group 3 are captured inside a look-ahead so that when Prefix-Postfix Selector is executed, only Group 1 is part of the overall match.

    Prefix-Postfix Selector is a conditional statement that does the following:

    • If Group 1 (Prefix) is matched, then add Group 2 (Value) to the overall match iff there is no Group 3 (Postfix) following it.

    • If no Prefix is matched, then set the overall match to Value followed by Postfix iff there is no Prefix preceding Value.

    The sub-expressions are fairly self-explanatory. The \bs make sure that entire words are matched. Similarly, the negative look-ahead (?![\d.,]) after the number makes sure that that no digits, commas and decimal points are left over.

    @Prefix@:

    [£€$]
    

    @Value@:

    \d{1,3}(?:,\d{3})*(?:\.\d+)?(?![\d.,])(?:\s*(?:[tTbB]n|m|(?:[tT]r|[bBmM])illion)\b)?
    |_____||_________||________||        ||                                            |
    |__________________________||________||____________________________________________|
                 |                   |                          |
       Number, e.g. 12,345.6         |    [[Whitespace] + Value Word, e.g. Tn or Billion]
                                     |
               Makes sure "1000" is not matched, for example
    

    @Postfix@:

    \s*(?:[pP](?:ence)?|[cC](?:ents?)?|€|[eE]uros?|[dD]ollars?|[pP]ounds?)\b
    

    Solution:

    Replacing the placeholders with the sub-expressions leads to this full regex:

    ([£€$])?(?=(\d{1,3}(?:,\d{3})*(?:\.\d+)?(?![\d.,])(?:\s*(?:[tTbB]n|m|(?:[tT]r|[bBmM])illion)\b)?)(\s*(?:[pP](?:ence)?|[cC](?:ents?)?|€|[eE]uros?|[dD]ollars?|[pP]ounds?)\b)?)(?(1)\2(?!\3)|(?<![£€$])\2\3)
    

    Demo


    Caveats:

    • A value like "$1" appearing in a sentence followed by a comma or full stop will not be matched. (For example, only $2 is matched in the sentence This sentence is worth $1, $2 or $3..)

    • A value consisting of a number with commas and/or a decimal point and a value word is allowed, e.g. "1,000,000 million".