I'm writing a crawler in java to crawl some websites, which may have some unicode characters such as "£". When I stored the content (source HTML) in a Java String, these kinds of chars get lost and are replaced by the question mark "?". I'd like to know how to keep them intact. The related code is as follows:
protected String readWebPage(String weburl) throws IOException{
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet(weburl);
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = httpclient.execute(httpget, responseHandler);
// responseBody now contains the contents of the page
httpclient.getConnectionManager().shutdown();
return responseBody;
}
// function call
String res = readWebPage(url);
PrintWriter out = new PrintWriter(outDir+name+".html");
out.println(res);
out.close();
And later when doing character matches, I also want to be able to do something like:
if(text.indexOf("£")>=0)
I don't know if Java will recognize that character and do as what I want it to do.
Any input will be greatly appreciated. Thanks in advance.
Your non-ASCII characters are either getting lost on input to Java or on output.
Java works with Unicode strings internally so you have to tell it how to decode input and encode output.
Let's assume that HttpClient
is correctly interpreting the response from the remote server and is decoding the response correctly.
Next up, you have to ensure that you encode the contents correctly when you write it to disk. Java uses local environment variables to guess what encoding to use, which may not be suitable. To force the encoding, pass the encoding type to PrintWriter:
PrintWriter out = new PrintWriter(outDir+name+".html", "UTF-8");
Then check your output.html with a text editor, such as Notepad++, running in UTF-8 mode to ensure that you can still see non-ASCII chars.
If you can't then you'll need to turn your attention to the input - HttpClient. See this answer: Set response encoding with HttpClient 3.1 for clues if your remote server is lying about the character encoding.
In answer to your sub-question. You can use non-ASCII chars, such as "£", in your source code if you tell Java what character encoding your source code is in. This is a parameter to javac
but as you're likely to be using an IDE, you can simply set the character encoding of your file in the properties and the IDE will do the rest. The most portable thing to do is set your character encoding in your IDE to "UTF-8". Eclipse allows you to set the character encoding for the whole project or on individual files.