java html utf-8 stax apache-commons-lang

Characters generated by Apache Commons StringEscapeUtils.unescapeHtml cannnot be parsed using StAX

I am trying to parse content of HTML table and write it to CSV. I am trying StaX parser The html contains escaped characters like &nbps' and &

I am using org.apache.commons.lang3.StringEscapeUtils to usescape the html line by line and write to a new file.

StAX still fails to parse the unescaped characters.

Please help me fix or handle this exception.

I test with below xml fragment - <root><element>A   B   </element></root>

I call below code to unescape html -

   StringEscapeUtils.unescapeHtml4(escapedHtml)

and write it to a file.

I then try to parse that file using Stax Parser -

public void unescapeHtmlFile(String filePath) throws IOException{
    BufferedReader fileReader = null;
    BufferedWriter fileWriter = null;
    try{
    fileReader = new BufferedReader(new FileReader(filePath));
    fileWriter = new BufferedWriter(new FileWriter("./out/UnescapedHtml.html"));

    String line = null;
    String unescapedLine = null;
    while((line=fileReader.readLine())!=null){
        System.out.println("Before: " + line);
        unescapedLine = StringEscapeUtils.unescapeHtml4(line);
        System.out.println("After: " + unescapedLine);
        fileWriter.newLine();
        fileWriter.write(unescapedLine);
    }
    }finally{
        fileReader.close();
        fileWriter.close();
    }
}

And the output is below-

Document started 
<?xml version="null" encoding='UTF-8' standalone='no'?>
Element started
<root>
Element started
<element0>
Characters
0123456   7890   ABC   DEF
Element ended
</element0>
Element started
<element1>
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,66]
Message: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)
    at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83)
    at parser.StreamParserTest.main(StreamParserTest.java:30)
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,66]
Message: XML document structures must start and end within the same entity.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:598)
    at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83)
    at parser.StreamParserTest.main(StreamParserTest.java:30)

It fails to parse the unescaped value of   Please help.

Solution

The classes FileReader and FileWriter are old utility classes, that unfortunately use the current platform encoding. On Windows almost certainly not UTF-8. And XML in general is in UTF-8 (which indeed can represent all characters.

fileReader = new BufferedReader(new FileReader(filePath));
fileWriter = new BufferedWriter(new FileWriter("./out/UnescapedHtml.html"));

should be

fileReader = new BufferedReader(new InputStreamReader(
        new FileInputStream(filePath), StandardCharsets.UTF_8));
fileWriter = new BufferedWriter(new OutputStreamWriter(
        new FileOutputStream("./out/UnescapedHtml.html"),
        StandardCharsets.UTF_8));

To be entirely honest, one should read <?xml ...?> and look whether it has an encoding attribute for the charset, default is UTF-8. That could be done with StandardCharsets.ISO_8859_1, as UTF-8 stumbles over wrong multi-byte sequences.

Using StandardCharsets instead of Strings "UTF-8" does away with

an UnsupportedEncodingException to handle,
a magic constant.

The StandardCharsets are guaranteed to be supported.