I am trying to parse content of HTML table and write it to CSV.
I am trying StaX parser
The html contains escaped characters like &nbps'
and &
I am using org.apache.commons.lang3.StringEscapeUtils
to usescape the html line by line and write to a new file.
StAX still fails to parse the unescaped characters.
Please help me fix or handle this exception.
I test with below xml fragment -
<root><element>A B </element></root>
I call below code to unescape html -
StringEscapeUtils.unescapeHtml4(escapedHtml)
and write it to a file.
I then try to parse that file using Stax Parser -
public void unescapeHtmlFile(String filePath) throws IOException{
BufferedReader fileReader = null;
BufferedWriter fileWriter = null;
try{
fileReader = new BufferedReader(new FileReader(filePath));
fileWriter = new BufferedWriter(new FileWriter("./out/UnescapedHtml.html"));
String line = null;
String unescapedLine = null;
while((line=fileReader.readLine())!=null){
System.out.println("Before: " + line);
unescapedLine = StringEscapeUtils.unescapeHtml4(line);
System.out.println("After: " + unescapedLine);
fileWriter.newLine();
fileWriter.write(unescapedLine);
}
}finally{
fileReader.close();
fileWriter.close();
}
}
And the output is below-
Document started
<?xml version="null" encoding='UTF-8' standalone='no'?>
Element started
<root>
Element started
<element0>
Characters
0123456 7890 ABC DEF
Element ended
</element0>
Element started
<element1>
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,66]
Message: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83)
at parser.StreamParserTest.main(StreamParserTest.java:30)
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,66]
Message: XML document structures must start and end within the same entity.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:598)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83)
at parser.StreamParserTest.main(StreamParserTest.java:30)
It fails to parse the unescaped value of
Please help.
The classes FileReader and FileWriter are old utility classes, that unfortunately use the current platform encoding. On Windows almost certainly not UTF-8. And XML in general is in UTF-8 (which indeed can represent all characters.
fileReader = new BufferedReader(new FileReader(filePath));
fileWriter = new BufferedWriter(new FileWriter("./out/UnescapedHtml.html"));
should be
fileReader = new BufferedReader(new InputStreamReader(
new FileInputStream(filePath), StandardCharsets.UTF_8));
fileWriter = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("./out/UnescapedHtml.html"),
StandardCharsets.UTF_8));
To be entirely honest, one should read <?xml ...?>
and look whether it has an encoding
attribute for the charset, default is UTF-8. That could be done with StandardCharsets.ISO_8859_1
, as UTF-8 stumbles over wrong multi-byte sequences.
Using StandardCharsets instead of Strings "UTF-8" does away with
The StandardCharsets are guaranteed to be supported.