Search code examples
c#regexhtml-agility-packphone-number

Parse International Phone numbers from web pages


I am using HtmlAgilityPack to parse the webpages. once the document is loaded, I want to extract the possible phone numbers from HTML. Currently, I am using some regex for this purpose. I have following piece of code that checks for the match of phone numbers in webpage

    private static string phoneReg =
                @"[\+]{0,1}(\d{10,13}|[\(][\+]{0,1}\d{2,}[\13)]*\d{5,13}|\d{2,6}[\-]{1}\d{2,13}[\-]*\d{3,13})";
            private static Regex phoneRegex = new Regex(phoneReg, RegexOptions.IgnoreCase);
var phoneMatches = phoneRegex.Matches(doci.DocumentNode.InnerText);

where doci is HtmlDocument abstraction from html agility pack. The problem is that it fails to match some phone numbers like 08450 211 211 and +44 (0) 1246 733 000.

Is there a generic regex expression that is most suitable when crawling websites and it allows the matching of most forms of international phone numbers?


Solution

  • You cannot match those phone numbers (08450 211 211 and +44 (0) 1246 733 000) because your regex simply doesn't match them.

    The first thing you have to do when writing a regular expression is to identify the pattern you want to match.

    So, my suggestion is to write down a list of the different phone number formats, update your question, and then we will be able to help you. Otherwise I will always create a new phone number that your regex might not match, or it will just match more than whan you want.

    Here is a regex that will match the above phone numbers:

    (?:\+\d+\s+\(\d+\)\s+)?\d{4,5}\s+\d{3}\s+\d{3}
    

    Edit:

    According to your comment, I would just use this regex, and then remove the ones that are not phone numbers:

    (?:\+\d+\s+\(\d+\)\s+)?[\d -]+