Use HtmlUnit as crawler

I needed a headless browser to parse pages. HtmlUnit allow me to setup a Heroku Java app to fullfil this purpose.

But now I'm meeting with couple of issues.

The current one is malformed url "//path" instead of "/path" or "http(s)://path". I downloaded sources of the 2.9.4 version and pushed tiny fixes in the sources ... It's not really efficient to modify standard sources for obvious maintainability reasons.

I'm so wondering if i'm not digging in the wrong direction. HtmlUnit is designed to browse pages in a testing purpose. Mine is to do like a browser, so make page working the most possible, especially because my damned targeted websites are the kind of ultra-dirty-not-respecting-anything...

What is your opinion about this retrospection ?

Solution

HTML Unit is used in Selenium 2/Web Driver for headless browser "simulation". There it works fine.

So I see no reason not to try Html Unit. May you can have a look at Selenium 2/Web Driver too.

NullPointerException in Java JDK code while performing parallelStream.forEach(..)
Locating code that is filling PermGen with dead Groovy code
Cannot find test class in project - "The input type of the launch configuration does not exist"
Passing the values to the fraction class in java
How to use different authentication methods for different paths using Spring Security
how to fetch and validate csv header in open csv?
4-Sum algorithm failing with duplicate values in Java
Error getting month while using Calendar object
java Calendar Timezones strange stuff
java.util.Date class with different approach for same date gives different output
Finding out the type of invoked method in JDT
How to get the size of a file in MB (Megabytes)?
Exception in thread "main" java.lang.IllegalStateException: Attempted to load Config resource 'class path resource [application.yml]'
Why does Google Calendar.Events.Watch say my request is not HTTPS
Java - Class.forName, how to get a Field from Class
Use of verify() method with and without times(1) parameter
JDBC connection without database definition
Running my testNG project from a jar using Maven
Do subclasses inherit interfaces?
Spring Security - how can I ask invoke access control methods directly?
most accurate time type in java?
How to get the currently selected application from Windows in Java
validate the credit card expiry date using java?
Finding the square root of a number by using binary search
log4j : current time in milliseconds
Why Joda DateTime gives different result than Java Date?
Tomcat - maxThreads vs. maxConnections
Android notification importance cannot be changed
How to enable all endpoints in actuator (Spring Boot 2.0.0 RC1)
Map rainbow colors to RGB