Search code examples
androidregexhttpclientiso-8859-1

Android regex encoding


I'm downloading website's source code using HttpClient and then I want to extract some data using regular expressions. Unfortunetely the website is encoded in iso-8859-1 which seems to be causing problems. Here's the sample code to download website:

HttpGet query = new HttpGet(url);
HttpResponse queryResponse = httpClient.execute(query);
String queryText = EntityUtils.toString(queryResponse.getEntity()).replaceAll("\r", " ").replaceAll("\n", " ");

And then the expression:

Pattern patter = Pattern.compile("<p class=\"qt\">(.*?)</p>");
Matcher matcher = pattern.matcher(queryText);
while (matcher.find()) // do something

The problem is that it's missing some occurences, when there are special iso-8859-1 characters. (.*?) doesn't seem to match them. What's the reason of this problem? How do I fix it?


Solution

  • Are you sure this has to do with "special iso-8859-1 characters" and not newlines? . does not match line terminators by default. You can use the DOTALL flag to enable matching of line terminators as well. eg:

    Pattern patter = Pattern.compile("<p class=\"qt\">(.*?)</p>", Pattern.DOTALL);