Search code examples
javaxmlxstream

XStream entity abbrieviation parsing


I'm currently trying to parse the Japanese JMdict xml document and it declares a bunch of ENTITY references that are used throughout the document.
Like this bit here:

<!ENTITY MA "martial arts term">
<!ENTITY X "rude or X-rated term (not displayed in educational software)">
<!ENTITY abbr "abbreviation">
<!ENTITY adj-i "adjective (keiyoushi)">
<!ENTITY adj-ix "adjective (keiyoushi) - yoi/ii class">

There are then referenced in the xml like so <field>&MA;</field>

XStream does not like this and demands that I fix this and then promptly throws a ConversionException and quits.

Is there a way to automatically recognize these entities and swap them out?
I'd prefer not having to write 170 lines of xml = xml.replace(one, other);

I'm just using XPP3 and then annotations to create POJOs from the data to begin with. No custom parser.


Solution

  • Since you say you're using XPP3, I assume that you are creating your XStream object like this:

    XStream xstream = new XStream();  //uses XPP3
    

    The problem is that XPP3 apparently does not resolve entities out of the box:

    ...it is user responsibility to resolve entity reference.

    So unless you want to implement entity resolution, you need to use a parser that does resolve entities. If you want to stay with a pull parser, you can use StAX like this:

    XStream xstream = new XStream(new StaxDriver());
    

    Alternatively you could use DOM (not a pull parser; loads the entire document into memory):

    XStream xstream = new XStream(new DomDriver());