Search code examples
pythonregexpython-re

Regular expression to extract integers


I need help in extracting number from a column that store texts. In the text, there can be also some prices that I don't want to extract. As an example, if I have the following text:

text = "I have the following products 4526 and 4. The first one I paid $40 while the second one 30€. 
Here the link for the discount of 3.99: https://www.xysyffd.coom/7574@5757"

My expected result would be

[4526, 4]

Right now what I have used the following regular expression

'(?<![\d.])[0-9]+(?![\d.])'

which is able to discard the 3.99 but still it recognize the prices and the number in the link. Any suggestion on how to update the re?


Solution

  • Use

    (?<!\S)[0-9]+(?!\.\d|[^\s!?.])
    

    See proof.

    EXPLANATION

    --------------------------------------------------------------------------------
      (?<!                     look behind to see if there is not:
    --------------------------------------------------------------------------------
        \S                       non-whitespace (all but \n, \r, \t, \f,
                                 and " ")
    --------------------------------------------------------------------------------
      )                        end of look-behind
    --------------------------------------------------------------------------------
      [0-9]+                   any character of: '0' to '9' (1 or more
                               times (matching the most amount possible))
    --------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
    --------------------------------------------------------------------------------
        \.                       '.'
    --------------------------------------------------------------------------------
        \d                       digits (0-9)
    --------------------------------------------------------------------------------
       |                        OR
    --------------------------------------------------------------------------------
        [^\s!?.]                 any character except: whitespace (\n,
                                 \r, \t, \f, and " "), '!', '?', '.'
    --------------------------------------------------------------------------------
      )                        end of look-ahead
    

    Python code:

    import re
    regex = r"(?<!\S)[0-9]+(?!\.\d|[^\s!?.])"
    test_str = "I have the following products 4526 and 4. The first one I paid $40 while the second one 30€. \nHere the link for the discount of 3.99: https://www.xysyffd.coom/7574@5757"
    matches = re.findall(regex, test_str)
    print(matches)
    

    Results: ['4526', '4']