I'm writing a program that scrapes blog posts from a number of web sites. I'm trying to extract their Australian formatted phone numbers from free text. This has proven to be fairly difficult.
Here are a few constructed blog post examples:
Example 1:
"Hello, my name is Alicia I'm 32 and have lived in Brisbane for the past 40 years. I'm 6" tall and an agile runner. Since 2004 I have been running for 2-3 times per week. Feel free to call +61 (04) 654 456 or try my other number 0434 43 22 34."
From this blog post I need to extract "04654456" and "0434432234"
Example 2:
"I'm Joe and also love running. Standing 7" feet tall and have been going at it since 2004. For training advice pls call 043 572-6087 or (02) 1232 23 56."
From this blog post I need to extract "0435726087 and "0212322356".
Example 3:
"My name is Pricilla and I love running. You can reach me on 0 434 45 45 12, but don't call before 12 am pls (I got clients up until 10-11-ish). My license number is 4335TE33 and I drive a 2004 Ford Bronco with brand new 6" tires. I can run 28 km, but usually require a break every 3 or 4 km. Call me today (04) 3 445 4512"
From this blog post I need to extract "0434454512".
I have come up with quite an elaborate system that for each blog entry does the following:
1) Strip away all non numeric characters, trims and remove double spaces
2) Converts the string to an array. So now we just have an array of numbers e.g ['0', '434', '45', '45, '12', '4335', '33', '2004', '6', '28', '3', '4', '04', '34', '832', '234]
3) Iterate through the array of numbers and apply rules to piece it together. This code is bloated and not very pretty.
4) Validate the result using a RegExp pattern for Australian mobile and land line numbers
Obviously I have tried with regular expressions, but they fail big time in this case.
My system works most of the time, but the code is not pretty to say the least.
How would you attack this?
What you are looking for is actually a research area in Natural Language Processing known are entity extraction. There are many approaches to the problem and several mathematical models to solve such tasks, fortunately there are toolkits available that do similar tasks -OpenNLP and Stanford NER are couple of examples. It has tools to automatically extract Names, Dates, Parts of Speech etc. You might be able to modify it to extract phone numbers - one thing to know is that these are statistical models (as oppose to rule based which is your current approach) so you would need training data.
Note that this might require significant changes to what you are currently doing so it may or may not be worth it, but if you are going to be working on such problems related to entity extraction from unstructured text it might be worth knowing about these tools.
I would start by looking into OpenNLP/Stanford documentation to see if what you are looking for is possible.