Search code examples
pythonregex

Regular expression to extract monetary values from an invoice text


I have a regular expression: \b[-+]?(?:\d{1,3}\.)(?:\d{3}\.)*(?:\d*) (Python) that matches numeric values in strings like this:

Amount: 12.234.55222 EUR
Some Text 123.222.22 maybe more text
1.245.455.2
22.34565 Could be at the beginning
It could be at the end of a string 21.1
221. It could be a number like this too (for US invoices, I saw a lot of different stuff)

Which is what I want. But it also matches the first 2 parts of a date like this: 08.05.2023

I know this is happening because of the first and the last group, but I don't know how to prevent that. I only want to match values that stand by themselves.
Can somebody point me in the right direction?

Edit: I forgot to mention that I've tried it with a negative look behind, but that didn't work:

\b([-+]?(?:\d{1,3}\.)(?:\d{3}\.)*(?:\d*))(?!(?:\.d{4}))

Maybe I'm doing the look behind wrong?


Solution

  • With the idea of @user24714692 I've done it this way:

    def float_check(text: str) -> bool:
        try:
            float(text)
            return True
        except ValueError:
                return False
            
    def match_amount(amount: float, text: str) -> bool:
        pattern = r"^([-+]?(?:\d{1,3}\.)(?:\d{3}\.)*(?:\d*))$"
        # find all numbers, that match a pattern like i.e.: (123./,)*23
    
        # replace all comma, because the US is doing the numbers like this 291,234.23
        text = text.replace(",", ".")
    
        numbers = [
            pot_number for pot_number in text.split() if re.findall(pattern, pot_number)
        ]
    
        for number in numbers:
            # now replace all dots but the last one
            number = re.sub(r"\.(?=.*\.)", "", number)
            # and convert the numbers to float, and check if the number equals the given amount
            if float_check(number) and float(number) == amount:
                return True
        return False
        
    print(match_amount(23.45, "some text with numbers here"))
    

    Works pretty fine now, thank you very much.