Search code examples
htmlhttputf-8apache-httpclient-4.xfluent

How to use the Fluent API of Apache HttpClient to read UTF-8 coded website?


String html = Request.Get("https://kokos.pl/")
        .execute().returnContent().asString();

System.out.println(html);

What I obtain in the 12th line is:

<title>Szybkie po??yczki got??wkowe, po??yczki spo??eczno??ciowe - Kokos.pl</title>

while it should be:

<title>Szybkie pożyczki gotówkowe, pożyczki społecznościowe - Kokos.pl</title>

Solution

  • [DEBUG] DefaultClientConnection - Sending request: GET / HTTP/1.1
    [DEBUG] headers - >> GET / HTTP/1.1
    [DEBUG] headers - >> Host: kokos.pl
    [DEBUG] headers - >> Connection: Keep-Alive
    [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2.5 (java 1.5)
    [DEBUG] DefaultClientConnection - Receiving response: HTTP/1.1 200 OK
    [DEBUG] headers - << HTTP/1.1 200 OK
    [DEBUG] headers - << Server: nginx
    [DEBUG] headers - << Date: Thu, 01 Aug 2013 12:04:12 GMT
    [DEBUG] headers - << Content-Type: text/html
    [DEBUG] headers - << Connection: keep-alive
    ...
    

    The response message returned by the server for this URI does not explicitly specify the charset of the content. In such cases HttpClient is forced to use the default charset encoding for HTTP content, which is ISO-8859-1 and not UTF-8.

    Unfortunately the only way to override the default content charset used by fluent API is by using a custom response handler

    ResponseHandler<String> myHandler = new ResponseHandler<String>() {
        @Override
        public String handleResponse(
                final HttpResponse response) throws IOException {
            return EntityUtils.toString(response.getEntity(), Consts.UTF_8);
        }
    };
    
    String html = Request.Get("https://kokos.pl/").execute().handleResponse(myHandler);
    
    System.out.println(html);