Search code examples
javaweb-crawlerhtmlunit

Use HtmlUnit as crawler


I needed a headless browser to parse pages. HtmlUnit allow me to setup a Heroku Java app to fullfil this purpose.

But now I'm meeting with couple of issues.

The current one is malformed url "//path" instead of "/path" or "http(s)://path". I downloaded sources of the 2.9.4 version and pushed tiny fixes in the sources ... It's not really efficient to modify standard sources for obvious maintainability reasons.

I'm so wondering if i'm not digging in the wrong direction. HtmlUnit is designed to browse pages in a testing purpose. Mine is to do like a browser, so make page working the most possible, especially because my damned targeted websites are the kind of ultra-dirty-not-respecting-anything...

What is your opinion about this retrospection ?


Solution

  • HTML Unit is used in Selenium 2/Web Driver for headless browser "simulation". There it works fine.

    So I see no reason not to try Html Unit. May you can have a look at Selenium 2/Web Driver too.