I have the following Java code to fetch the entire contents of an HTML page at a given URL. Can this be done in a more efficient way? Any improvements are welcome.
public static String getHTML(final String url) throws IOException {
if (url == null || url.length() == 0) {
throw new IllegalArgumentException("url cannot be null or empty");
}
final HttpURLConnection conn = (HttpURLConnection) new URL(url).openConnection();
final BufferedReader buf = new BufferedReader(new InputStreamReader(conn.getInputStream()));
final StringBuilder page = new StringBuilder();
final String lineEnd = System.getProperty("line.separator");
String line;
try {
while (true) {
line = buf.readLine();
if (line == null) {
break;
}
page.append(line).append(lineEnd);
}
} finally {
buf.close();
}
return page.toString();
}
I can't help but feel that the line reading is less than optimal. I know that I'm possibly masking a MalformedURLException
caused by the openConnection
call, and I'm okay with that.
My function also has the side-effect of making the HTML String have the correct line terminators for the current system. This isn't a requirement.
I realize that network IO will probably dwarf the time it takes to read in the HTML, but I'd still like to know this is optimal.
On a side note: It would be awesome if StringBuilder
had a constructor for an open InputStream
that would simply take all the contents of the InputStream
and read it into the StringBuilder
.
As seen in the other answers, there are many different edge cases (HTTP peculiarities, encoding, chunking, etc) that should be accounted for in any robust solution. Therefore I propose that in anything other than a toy program you use the de facto Java standard HTTP library: Apache HTTP Components HTTP Client.
They provide many samples, "just" getting the response contents for a request looks like this:
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://www.google.com/");
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = httpclient.execute(httpget, responseHandler);
// responseBody now contains the contents of the page
System.out.println(responseBody);
httpclient.getConnectionManager().shutdown();