Search code examples
javaunicodeutf8-decode

Java UTF-8 encoding not set to URLConnection


I'm trying to retrieve data from http://api.freebase.com/api/trans/raw/m/0h47

As you can see in text there are sings like this: /ælˈdʒɪəriə/.

When I try to get source from the page I get text with sings like ú etc.

So far I've tried with the following code:

urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");

What am I doing wrong?

My entire code:

URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}

try {
    urlConn = url.openConnection(); 
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");

urlConn.setDoInput(true);
urlConn.setUseCaches(false);

StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
    strBseznam.deleteCharAt(strBseznam.length() - 1);

try {
    input = new DataInputStream(urlConn.getInputStream()); 
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
    while (null != ((str = input.readLine()))) 
    {
        strB.append(str); 
    }
    input.close();
} catch (IOException e) { e.printStackTrace(); }

Solution

  • The HTML page is in UTF-8, and could use arabic characters and such. But those characters above Unicode 127 are still encoded as numeric entities like ú. An Accept-Encoding will not, help, and loading as UTF-8 is entirely right.

    You have to decode the entities yourself. Something like:

    String decodeNumericEntities(String s) {
        StringBuffer sb = new StringBuffer();
        Matcher m = Pattern.compile("\\&#(\\d+);").matcher(s);
        while (m.find()) {
            int uc = Integer.parseInt(m.group(1));
            m.appendReplacement(sb, "");
            sb.appendCodepoint(uc);
        }
        m.appendTail(sb);
        return sb.toString();
    }
    

    By the way those entities could stem from processed HTML forms, so on the editing side of the web app.


    After code in question:

    I have replaced DataInputStream with a (Buffered)Reader for text. InputStreams read binary data, bytes; Readers text, Strings. An InputStreamReader has as parameter an InputStream and an encoding, and returns a Reader.

    try {
        BufferedReader input = new BufferedReader(
                new InputStreamReader(urlConn.getInputStream(), "UTF-8")); 
        StringBuilder strB = new StringBuilder();
        String str;
        while (null != (str = input.readLine())) {
            strB.append(str).append("\r\n"); 
        }
        input.close();
    } catch (IOException e) {
        e.printStackTrace();
    }