Search code examples
javahtmlurlcharacter-encodingbufferedreader

Using Java bufferedreader to get html from URL


I'm trying to read all the html from a page using a buffered reader like follows

 String charset = "UTF-8";
 URLConnection connection = new URL(url).openConnection();
    connection.addRequestProperty("User-Agent", 
                    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
    connection.setRequestProperty("Accept-Charset", charset);
    InputStream response = connection.getInputStream();
    BufferedReader br = new  BufferedReader(new InputStreamReader(response,charset));

then I'm reading it line by line like this:

String data = br.readLine();
while(data != null){
data = br.readLine();
}

the problem is I'm getting something like:

}$B!)(BL$B!)(Bu"~$B!)$(D"C(B|X$B!x!)!x(B}

I've tried this:

do {
        data = br.readLine();
        SortedMap<String, Charset> map = Charset.availableCharsets();
        for(Map.Entry<String, Charset> entry : map.entrySet()){
            System.out.println(entry.getKey());

            try {
                System.out.println(new String(data.getBytes(entry.getValue())));
            } catch (Exception e) {
                e.printStackTrace();
            }

        }
}while(data!=null)

and I'm not getting any readable html in any of them. This really weird since it was working fine until this morning and I didn't change anything.. What am I doing wrong here? is it possible that something changed in the website I'm trying to read? please help.


Solution

  • The Server has changed his transfer mode to compressed data, what you can see in response header from server:

    Connection:keep-alive
    Content-Encoding:gzip
    Content-Type:text/html; charset=utf-8
    Date:Mon, 09 Mar 2015 09:34:41 GMT
    Server:nginx
    Transfer-Encoding:chunked
    Vary:Accept-Encoding
    X-Powered-By:PHP/5.5.16-pl0-gentoo
    

    As you can see the content encoding is set to gzip Content-Encoding:gzip. So you have to decode the zipped content first:

    GZIPInputStream gzis = new GZIPInputStream(connection.getInputStream());
    BufferedReader br = new  BufferedReader(new InputStreamReader(gzis,charset));
    

    To view the headers of requests and responses you could use a network monitor (see Free Network Monitor).

    Simpler is it to use the developer plugins integrated in most common browsers. Here is the documentation of Chrome DevTools, how to use the network tab: https://developer.chrome.com/devtools/docs/network