I am using HtmlAgilityPack to parse the webpages. once the document is loaded, I want to extract the possible phone numbers from HTML. Currently, I am using some regex for this purpose. I have following piece of code that checks for the match of phone numbers in webpage
private static string phoneReg =
@"[\+]{0,1}(\d{10,13}|[\(][\+]{0,1}\d{2,}[\13)]*\d{5,13}|\d{2,6}[\-]{1}\d{2,13}[\-]*\d{3,13})";
private static Regex phoneRegex = new Regex(phoneReg, RegexOptions.IgnoreCase);
var phoneMatches = phoneRegex.Matches(doci.DocumentNode.InnerText);
where doci
is HtmlDocument
abstraction from html agility pack. The problem is that it fails to match some phone numbers like 08450 211 211
and +44 (0) 1246 733 000
.
Is there a generic regex expression that is most suitable when crawling websites and it allows the matching of most forms of international phone numbers?
You cannot match those phone numbers (08450 211 211
and +44 (0) 1246 733 000
) because your regex simply doesn't match them.
The first thing you have to do when writing a regular expression is to identify the pattern you want to match.
So, my suggestion is to write down a list of the different phone number formats, update your question, and then we will be able to help you. Otherwise I will always create a new phone number that your regex might not match, or it will just match more than whan you want.
Here is a regex that will match the above phone numbers:
(?:\+\d+\s+\(\d+\)\s+)?\d{4,5}\s+\d{3}\s+\d{3}
Edit:
According to your comment, I would just use this regex, and then remove the ones that are not phone numbers:
(?:\+\d+\s+\(\d+\)\s+)?[\d -]+