Search code examples
algorithmscreen-scrapingnlppattern-matchingnamed-entity-recognition

Algorithms recognizing physical address on a webpage


What are the best algorithms for recognizing structured data on an HTML page?

For example Google will recognize the address of home/company in an email, and offers a map to this address.


Solution

  • A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.