I am very new to regular expression & seeking help to parse out phone numbers from HTML text
At source site, the html tags are very distorted & does not have any unique selectors that i can use . Below if the list of possibilities i am looking to parse.
raw = """+49 39291 55-217
02102 7007064
0152 01680970
+49 39291 55-216
02102 3802 22
0800 333004 451-100
+49 221 9937 26950
02151-47974510
+49(0)6105 937 -539
0211/409 2268
+49(0)6105 937 -539
+49211/584-623
0211 58422 2012
+49 (9131) 7-35335
+49 521 9488 2470
+ 49-40-70 70 84 - 0
0211 17 95 99 04
02151-47974327
+49 203 28900 1121
0211 9449-2555
+49 (5 41) 9 98 -2268"""
I tried this pattern but could not make out more from it
import re, requests
Phones = re.findall(re.compile(r'.*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?'),raw)
phones
['102 7007064', '152 0168097', '151-4797451', '937 -539\n0211', '937 -539\n+4921', '584-623\n0211', '151-4797432']
Any advise or help is highly appreciated. Thank you
I suggest using this pattern:
(?:\B\+ ?49|\b0)(?: *[(-]? *\d(?:[ \d]*\d)?)? *(?:[)-] *)?\d+ *(?:[/)-] *)?\d+ *(?:[/)-] *)?\d+(?: *- *\d+)?
See the regex demo. Note it is written based on your comment saying the phone numbers starts with +49
or a 0
and on the list of examples you provided. It may be considered "work in progress" since you have not provided more specific rules for phone number extraction.
Pattern details
(?:\B\+ ?49|\b0)
- a +
, optional space, 49
or a 0
, both substrings cannot be preceded with a word char(?: *[(-]? *\d(?:[ \d]*\d)?)?
- an optional substring matching 0+ spaces, then an optional (
or -
, 0+ spaces, a digit and then an optional sequence of digits/spaces followed with a digit *(?:[)-] *)?
- 0+ spaces and then an optional sequence of )
or -
followed with 0+ spaces\d+
- 1+ digits *
- 0+ spaces(?:[/)-] *)?
- an optional sequence of /
, )
or -
followed with 0+ spaces\d+
- 1+ digits *(?:[/)-] *)?
- 0+ spaces and then an optional sequence of /
, )
or -
followed with 0+ spaces\d+
- 1+ digits(?: *- *\d+)?
- an optional sequence: 0+ spaces, -
, 0+ spaces, 1+ digits.