Search code examples
htmlunitstringescapeutils

HtmlUnit - HTMLParser (page with characters)


I have a resource (a static html page), that I wanna use to test. But, when I get the static page, it comes with some characters encoding. I try with the class StringEscapeUtils but it doesn't work. My function:

  private HtmlPage getStaticPage() throws IOException, ClassNotFoundException {
    final Reader reader = new InputStreamReader(this.getClass().getResourceAsStream("/" + "testPage" + ".html"), "UTF-8");
    final StringWebResponse response = new StringWebResponse(StringEscapeUtils.unescapeHtml4(IOUtils.toString(reader)), StandardCharsets.UTF_8, new URL(URL_PAGE));
    return HTMLParser.parseHtml(response, WebClientFactory.getInstance().getCurrentWindow());
}

import org.apache.commons.lang3.StringEscapeUtils;


Solution

  • final Reader reader = new InputStreamReader(this.getClass().getResourceAsStream("/" + "testPage" + ".html"), "UTF-8");
    

    For the reader use the encoding of the file (from your comment i guess this is windows-1252 in your case). Then read the file into an string (e.g. use commons.io).

    Then you can process it like this

    final StringWebResponse tmpResponse = new StringWebResponse(anHtmlCode,
        new URL("http://www.wetator.org/test.html"));
    final WebClient tmpWebClient = new WebClient(aBrowserVersion);
    try {
      final HtmlPage tmpPage = HTMLParser.parseHtml(tmpResponse, tmpWebClient.getCurrentWindow());
      return tmpPage;
    } finally {
      tmpWebClient.close();
    }
    

    If you still have problem please make a simple sample out of your page that shows your problem and upload it here together with your code.