Search code examples
javastringurlinputstreambufferedreader

Read site source: � characters


I'm trying to read the source code from a browser, but when the code has characters like ã, á, à, õ, I get � instead.

I've tried to apply java.nio.Charset.encode on read lines, but no result: the same thing occurs.

My code is:

URLConnection connection = ...;
BufferedReader reader = new BufferedReader(connection.getInputStream());
String s = null;

while ((s = reader.readLine()) != null) {
  // got new source line...
}

The site I'm trying to read is this one (PT-BR).


Solution

  • According to the meta tag, the charset on that page is ISO-8859-1. Try using:

    Scanner scanner = new Scanner(connection.getInputStream(), "ISO-8859-1");