Search code examples
pythonregexpython-3.xregex-greedy

Python regular expression for phone numbers


I am very new to regular expression & seeking help to parse out phone numbers from HTML text

At source site, the html tags are very distorted & does not have any unique selectors that i can use . Below if the list of possibilities i am looking to parse.

raw = """+49 39291 55-217
02102 7007064
0152 01680970
+49 39291 55-216
02102 3802 22
0800 333004 451-100
+49 221 9937 26950
02151-47974510
+49(0)6105 937 -539
0211/409 2268
+49(0)6105 937 -539
+49211/584-623
0211 58422 2012
+49 (9131) 7-35335
+49 521 9488 2470
+ 49-40-70 70 84 - 0
0211 17 95 99 04
02151-47974327
+49 203 28900 1121
0211 9449-2555
+49 (5 41) 9 98 -2268"""

I tried this pattern but could not make out more from it

import re, requests

Phones = re.findall(re.compile(r'.*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?'),raw)

phones
['102 7007064', '152 0168097', '151-4797451', '937 -539\n0211', '937 -539\n+4921', '584-623\n0211', '151-4797432']

Any advise or help is highly appreciated. Thank you


Solution

  • I suggest using this pattern:

    (?:\B\+ ?49|\b0)(?: *[(-]? *\d(?:[ \d]*\d)?)? *(?:[)-] *)?\d+ *(?:[/)-] *)?\d+ *(?:[/)-] *)?\d+(?: *- *\d+)?
    

    See the regex demo. Note it is written based on your comment saying the phone numbers starts with +49 or a 0 and on the list of examples you provided. It may be considered "work in progress" since you have not provided more specific rules for phone number extraction.

    Pattern details

    • (?:\B\+ ?49|\b0) - a +, optional space, 49 or a 0, both substrings cannot be preceded with a word char
    • (?: *[(-]? *\d(?:[ \d]*\d)?)? - an optional substring matching 0+ spaces, then an optional ( or -, 0+ spaces, a digit and then an optional sequence of digits/spaces followed with a digit
    • *(?:[)-] *)? - 0+ spaces and then an optional sequence of ) or - followed with 0+ spaces
    • \d+ - 1+ digits
    • * - 0+ spaces
    • (?:[/)-] *)? - an optional sequence of /, ) or - followed with 0+ spaces
    • \d+ - 1+ digits
    • *(?:[/)-] *)? - 0+ spaces and then an optional sequence of /, ) or - followed with 0+ spaces
    • \d+ - 1+ digits
    • (?: *- *\d+)? - an optional sequence: 0+ spaces, -, 0+ spaces, 1+ digits.