Search code examples
javastringoptimizationinputstreammicro-optimization

What is the optimal way for reading the contents of a webpage into a string in Java?


I have the following Java code to fetch the entire contents of an HTML page at a given URL. Can this be done in a more efficient way? Any improvements are welcome.

public static String getHTML(final String url) throws IOException {
    if (url == null || url.length() == 0) {
        throw new IllegalArgumentException("url cannot be null or empty");
    }

    final HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection();
    final BufferedReader buf = new BufferedReader(new InputStreamReader(conn.getInputStream()));
    final StringBuilder page = new StringBuilder();
    final String lineEnd = System.getProperty("line.separator");
    String line;
    try {
        while (true) {
            line = buf.readLine();
            if (line == null) {
                break;
            }
            page.append(line).append(lineEnd);
        }
    } finally {
        buf.close();
    }

    return page.toString();
}

I can't help but feel that the line reading is less than optimal. I know that I'm possibly masking a MalformedURLException caused by the openConnection call, and I'm okay with that.

My function also has the side-effect of making the HTML String have the correct line terminators for the current system. This isn't a requirement.

I realize that network IO will probably dwarf the time it takes to read in the HTML, but I'd still like to know this is optimal.

On a side note: It would be awesome if StringBuilder had a constructor for an open InputStream that would simply take all the contents of the InputStream and read it into the StringBuilder.


Solution

  • As seen in the other answers, there are many different edge cases (HTTP peculiarities, encoding, chunking, etc) that should be accounted for in any robust solution. Therefore I propose that in anything other than a toy program you use the de facto Java standard HTTP library: Apache HTTP Components HTTP Client.

    They provide many samples, "just" getting the response contents for a request looks like this:

    HttpClient httpclient = new DefaultHttpClient();
    HttpGet httpget = new HttpGet("http://www.google.com/"); 
    ResponseHandler<String> responseHandler = new BasicResponseHandler();    
    String responseBody = httpclient.execute(httpget, responseHandler);
    // responseBody now contains the contents of the page
    System.out.println(responseBody);
    httpclient.getConnectionManager().shutdown();