I need help in extracting number from a column that store texts. In the text, there can be also some prices that I don't want to extract. As an example, if I have the following text:
text = "I have the following products 4526 and 4. The first one I paid $40 while the second one 30€.
Here the link for the discount of 3.99: https://www.xysyffd.coom/7574@5757"
My expected result would be
[4526, 4]
Right now what I have used the following regular expression
'(?<![\d.])[0-9]+(?![\d.])'
which is able to discard the 3.99 but still it recognize the prices and the number in the link. Any suggestion on how to update the re?
Use
(?<!\S)[0-9]+(?!\.\d|[^\s!?.])
See proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
\S non-whitespace (all but \n, \r, \t, \f,
and " ")
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[^\s!?.] any character except: whitespace (\n,
\r, \t, \f, and " "), '!', '?', '.'
--------------------------------------------------------------------------------
) end of look-ahead
import re
regex = r"(?<!\S)[0-9]+(?!\.\d|[^\s!?.])"
test_str = "I have the following products 4526 and 4. The first one I paid $40 while the second one 30€. \nHere the link for the discount of 3.99: https://www.xysyffd.coom/7574@5757"
matches = re.findall(regex, test_str)
print(matches)
Results: ['4526', '4']