I am wondering why my SaxParser
seems not to be able to resolve certain entities defined in an external dtd file. I am processing a huge xml file which has the following header. So the input is (heavily reduced :-)):
// myxml.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE authors SYSTEM "mydtd.dtd">
<authors>
<author>
Balázs
</author>
</authors>
And this is the incorrect output:
Bal
?zs
Obviousely á
was not resolved!
This is how I have set up the parser:
// MySaxParser.java
public class MySaxParser extends DefaultHandler {
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
if ("author".equals(currentTag)) {
System.out.println(String.valueOf(Arrays.copyOfRange(ch, start, start + length)));
}
}
static public void main(String[] args) throws Exception {
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, false);
spf.setNamespaceAware(true);
spf.setValidating(true); // From what I understood from the API this combined
// with '<!DOCTYPE mydtd SYSTEM "mydtd.dtd">' from
// the file myxml.xml should do the trick. What do I miss?
SAXParser saxParser = spf.newSAXParser();
XMLReader xmlReader = saxParser.getXMLReader();
xmlReader.setContentHandler(new SAXLocalNameCount());
xmlReader.setErrorHandler(new MyErrorHandler(System.err));
xmlReader.parse("file:/path/to/myxml.xml");
}
}
What do I miss? Do I somehow have to do more than spf.setValidating(true)
to make the parser aware of the dtd defined in the xml file header?
I should mention that the dtd and xml are syntactically and semantically correct. The dtd contains <!ENTITY aacute "á" ><!-- small a, acute accent -->
as a mapping for resolving. I donwloaded the files from a trusted source, so the error has to be in my Code.
Update:
As @eckes suggested, I printed the int values of the characters as they are passed into the method characters
via
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
if ("author".equals(currentTag)) {
for (int i = start; i < length; i++) {
System.out.println(ch[i] + " - " + Character.getNumericValue(ch[i]));
}
}
}
The console output was:
B - 11
a - 10
l - 21
? - -1
z - 35
s - 28
The -1 indicates that something went wrong before the event characters
was even fired, doesn't it?
My ErrorHandler:
package com.hw;
import java.io.PrintStream;
import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
class MyErrorHandler implements ErrorHandler {
private PrintStream out;
MyErrorHandler(PrintStream out) {
this.out = out;
}
private String getParseExceptionInfo(SAXParseException spe) {
String systemId = spe.getSystemId();
if (systemId == null) {
systemId = "null";
}
String info = "URI=" + systemId + " Line=" + spe.getLineNumber() + ": "
+ spe.getMessage();
return info;
}
public void warning(SAXParseException spe) throws SAXException {
out.println("Warning: " + getParseExceptionInfo(spe));
}
public void error(SAXParseException spe) throws SAXException {
String message = "Error: " + getParseExceptionInfo(spe);
throw new SAXException(message);
}
public void fatalError(SAXParseException spe) throws SAXException {
String message = "Fatal Error: " + getParseExceptionInfo(spe);
throw new SAXException(message);
}
}
You most certainly have a problem with the output encoding, i.e. the console or whatever that is receiving your output cannot correctly handle UTF-16 (which is the native java encoding).
And, you are also being tricked by the Characters#getNumericValue()
method into thinking that you have an input or parser encoding problem. The getNumericValue()
tries to interpret the character as something representing a number, not the actual code point value or anything such. As the documentation states, if you give the roman numeral fifty, Ⅼ (U+216C), the method would print 50
.
Try replacing the line:
System.out.println(ch[i] + " - " + Character.getNumericValue(ch[i]));
System.out.println(ch[i] + " - " + Character.getNumericValue(ch[i]));
with
System.out.println(ch[i] + " - " + Integer.toHexString((int) ch[i]));
and you'll probably see that it prints
? - e1
Now, how to fix the ouput encoding problem: I cannot help you there unless you give us more details.
Update
You can set the eclipse console encoding in
Run Configurations --> Common
or in the JDK/JRE using the
-Dfile.encoding
property (not 100% sure on this one).