Search code examples
javasynchronizationhtmlunit

HtmlUnit Synchronization Questions


I am using HtmlUnit in one of my web projects to screen scrape some code. I am wondering to what extent I need to synchronize the code. Currently I am synchronizing all code where I'm using the WebClient object to retrieve pages (i.e. webClient.getPage(url)). I assume that if webClient.getPage() is not synchronized, then the 'browser' could possibly try to load multiple pages at once (correct me if I'm wrong). To get around this, I'd probably have to open multiple windows, correct?

My question is concerning the HtmlPage, HtmlTable, etc. classes. After I retrieve an HtmlPage object, do I need to synchronize the reading of that page and other objects returned from the HtmlPage object (i.e. HtmlTable), or is the whole page cached into memory? I assume if it isn't cached, then if the WebClient calls getPage() again while I'm manipulating the previously returned HtmlPage object, bad things could happen.

I'd like to have a Connection class that has synchronized methods controlling calls to the WebClient that will return an HtmlPage and then manipulate the page without having to worry about synchronization. Are there any issues with this?

Example:

public MyConnection {
    private final WebClient webClient;        

    public MyConnection() {
    this.webClient = new WebClient();
    this.webClient.setTimeout(10 * 1000);
    this.webClient.setJavaScriptEnabled(false);
    this.webClient.setCssEnabled(false);
}

    public synchronized HtmlPage getHtmlPage(String url) {
        return webClient.getPage(url);
    }
}



public UseConnectionClass {
    private MyConnection conn;

    public void getAPage(String url) {
        return conn.getPage(url);
    }        
}

public ClientClass {
    public void doSomething() {
         UseConnectionClass useConn = new UseConnectionClass();
         HtmlPage page1 = useConn.getAPage("http://foobar1.com/");
         HtmlPage page2 = useConn.getAPage("http://foobar2.com/");
         // do something with page1...
         // do something with page2...

        page1.getElementsByTagName("table");
        page2.getElementsByTagName("table");

        // etc...

    }
}

EDIT: I know that WebClient is not thread-safe, hence the MyConnection object method getHtmlPage() in my example is synchronized.


Solution

  • As the javadoc says:

    a WebClient instance is not thread safe. It is intended to be used from a single thread.

    Each thread should have its own WebClient.