Search code examples
javahtmlhtml-content-extraction

Extracting Information from websites


Not every website exposes their data well, with XML feeds, APIs, etc

How could I go about extracting information from a website? For example:

...
<div>
  <div>
    <span id="important-data">information here</span>
  </div>
</div>
...

I come from a background of Java programming and coding with Apache XMLBeans. Is there anything similar to parse HTML, when I know the structure and the data is between a known tag?

Thanks


Solution

  • There are several Open Source HTML Parsers out there for Java.

    I have used JTidy in the past, and have had good luck with it. It will give you a DOM of the html page, and you should be able to grab the tags you need from there.