Search code examples
javaandroidhttpsurlconnection

Receiving encoded response to HttpsURLConnection GET request


I am working on an Android app which will connect to a webpage using the java class HttpsURLConnection and parse the HTML response using JSoup. The issue is that the HTML response from the website appears to be encoded. Any ideas on what I can do to get the actual HTML?

Here is my code for contacting the website:

private String GetPageContent(String url) throws Exception {

        URL obj = new URL(url);
        conn = (HttpsURLConnection) obj.openConnection();

        // default is GET
        conn.setRequestMethod("GET");

        conn.setUseCaches(false);

        // act like a browser
        conn.setRequestProperty("User-Agent", USER_AGENT);
        conn.setRequestProperty("Accept",
                "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
        conn.setRequestProperty("Accept-Language", "en-US,en;q=0.8,en-GB;q=0.6");
        conn.setRequestProperty("Accept-Encoding" , "gzip, deflate, sdch");
        conn.setRequestProperty("Connection" , "keep-alive");

        if (cookies != null) {
            for (String cookie : this.cookies) {
                conn.addRequestProperty("Cookie", cookie.split(";", 1)[0]);
            }
        }
        int responseCode = conn.getResponseCode();
        Log.v(TAG,"\nSending 'GET' request to URL : " + url);
        Log.v(TAG,"Response Code : " + responseCode);

        BufferedReader in = new BufferedReader(new InputStreamReader(
                conn.getInputStream()));
        String inputLine;
        StringBuffer response = new StringBuffer();

        while ((inputLine = in.readLine()) != null) {
            response.append(inputLine);
        }
        in.close();

        // Get the response cookies
        setCookies(conn.getHeaderFields().get("Set-Cookie"));

        return response.toString();

    }

And a snippet of the response:

��������������]�r�6��۞�w@ՙ�NDQ�ﱥ|�siv�Kkw�m&�HH�M,  Z��ff_c_o�d�@���9�l�6����� �_=w|����/A{��!W� LZ��������f]�=wc߽�2,˨�|�8x��~�}�x1�$Ib�Uq�7�j�X|;��K

EDIT: The HTML was encoded with GZIP, as shown in the request headers here.

The solution to this issue was to use the GZIPInputStream class as shown below:

BufferedReader in = new BufferedReader(new InputStreamReader(
                new GZIPInputStream(conn.getInputStream())));

Solution

  • Based on the headers returned with the request, we can conclude that the content is encoded using gzip. Luckily, there is an easy method to decode a gzip encoding stream, using the GZIPInputStream class.