Search code examples
parsingsaxmathmlcharacter-entities

Parsing with SAX and handling character entities


I am parsing a MathML expression with SAX (although the fact that it's MathML may not be completely relevant). An example input string is

<math xmlns='http://www.w3.org/1998/Math/MathML'>
     <mrow>
          <mo>&lambda;</mo>
     </mrow>
</math>

In order for the SAX parser to accept this string, I expand it a bit:

<?xml version="1.0"?>
     <!DOCTYPE doc_type [
          <!ENTITY nbsp "&#160;">
          <!ENTITY amp "&#38;">
]>
<body>
     <math xmlns='http://www.w3.org/1998/Math/MathML'>
          <mrow>
               <mo>&lambda;</mo>
          <mrow>
     </math>
</body>

Now, when I run the SAX parser on this, I get an exception:

[Fatal Error] :5:86: The entity "lambda" was referenced, but not declared.
org.xml.sax.SAXParseException: The entity "lambda" was referenced, but not 
                               declared.
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

However, I know how to fix that. I simply add this line to the string being parsed:

        <!ENTITY lambda "&#923;">

This gives me

<?xml version="1.0"?>
     <!DOCTYPE doc_type [
          <!ENTITY nbsp "&#160;">
          <!ENTITY amp "&#38;">
          <!ENTITY lambda "&#923;">
]>
<body>
     <math xmlns='http://www.w3.org/1998/Math/MathML'>
          <mrow>
               <mo>&lambda;</mo>
          <mrow>
     </math>
</body>

Now, it parses just fine, thank you.

However, the problem is that I can't add an ENTITY declaration for every possible character entity that might be used in MathML (for example, "part", "notin", and "sum").

How do I rewrite this string so that it can be parsed for any possible character entity that might be included?


Solution

  • Use a DOCTYPE declaration that refers to the MathML DTD:

    <!DOCTYPE math 
        PUBLIC "-//W3C//DTD MathML 3.0//EN"
               "http://www.w3.org/Math/DTD/mathml3/mathml3.dtd">
    

    or a local copy of the same.