Search code examples
phphtmlparsinghtml-content-extraction

How to extract data from a raw HTML file?


Is there a way to extract desired data from a raw html which has been written unsemantically with no IDs and classes? I mean, suppose there is a saved html file of a webpage (profile) and I want to extract the data like (say) 'hobbies'. Is it possible to do this using PHP?


Solution

  • Use regex! I kid, I kid. If you know the state of the same page, and the format is guaranteed to remain similar enough, then you can try writing a manual parser. Alternatively, there are a lot of libraries out there that will parse html for. I'm not familiar enough with PHP to recommend one, but I'm sure some Googleing could take you a long way. I've had luck with John Resig's pure javascript HTML parser before.

    At the end of the day, if you need semantic information from an html page that isn't constructed semantically, you're probably doomed programmatically and your best bet may be a mechanical turk.