Search code examples
c#.netxmlsgml

Are parameter entity references in sgml/xml parsible using .NET?


When I try and parse the data below with XDocument I am getting the following error:

"XMLException: A parameter entity reference is not allowed in internal markup"

Here is an example data that I am trying to parse:

<!DOCTYPE sgml [
  <!ELEMENT sgml ANY>
  <!ENTITY % std       "standard SGML">
  <!ENTITY % signature " &#x2014; &author;.">
  <!ENTITY % question  "Why couldn&#x2019;t I publish my books directly in %std;?">
  <!ENTITY % author    "William Shakespeare">
]>
<sgml>&question;&signature;</sgml>

Here is the code that is trying to parse the file above:

string caFile = @"pathToFile";
using (var caStream = File.Open(caFile, FileMode.Open, FileAccess.Read))
{
    var caDoc = XDocument.Load(caStream); // Exception thrown here!
}

Is there a way to get the built-in .NET xml parsing libraries to handle entity references, or at the very least ignore the embedded !Doctype and parse the root element?

NOTE: I am working under the assumption that parameter entity references are valid inside XML. (see here)


Solution

  • There are a few issues here, but mainly it appears you should be using General Entities instead:

    1. You are defining your entities to be Parameter Entities. These are basically macros that are for use only inside the DTD itself. From the XML Specification:

      Parameter-entity references MUST NOT appear outside the DTD.

      And from XML in a Nutshell 2nd Edition:

      It would be preferable to define a constant that can hold the common parts of the content specification for all five kinds of listings and refer to that constant from inside the content specification of each element. ...

      An entity reference is the obvious candidate here. However, general entity references are not allowed to provide replacement text for a content specification or attribute list, only for parts of the DTD that will be included in the XML document itself. Instead, XML provides a new construct exclusively for use inside DTDs, the parameter entity, which is referred to by a parameter entity reference. Parameter entities behave like and are declared almost exactly like a general entity. However, they use a % instead of an &, and they can only be used in a DTD while general entities can only be used in the document content.

      Your XML, however, is referring to the entities in its document content. This suggests you should be using general entities rather than parameter entities.

    2. One of your parameter entities, %question, embeds a reference to another parameter entity, %std;, in its replacement text. This is explicitly disallowed by the XML Specification:

      In the internal DTD subset, parameter-entity references MUST NOT occur within markup declarations; they may occur where markup declarations can occur. (This does not apply to references that occur in external parameter entities or to the external subset.)

      Again it appears you should be using general entities not parameter entities, since the former can be used "inside the DTD in places where they will eventually be included in the body of an XML document, for instance ... in the replacement text of another entity."

    3. You need to enable DTD processing by setting XmlReaderSettings.ProhibitDtd = false (.Net 3.5) or XmlReaderSettings.DtdProcessing = DtdProcessing.Parse (later versions).

    Putting this together, the following code:

        string xmlGood = @"<!DOCTYPE sgml [
      <!ELEMENT sgml ANY>
      <!ENTITY std       ""standard SGML"">
      <!ENTITY signature "" &#x2014; &author;."">
      <!ENTITY question  ""Why couldn&#x2019;t I publish my books directly in &std;?"">
      <!ENTITY author    ""William Shakespeare"">
    ]>
    <sgml>&question;&signature;</sgml>";
    
        var settings = new XmlReaderSettings { DtdProcessing = DtdProcessing.Parse };
    
        using (var sr = new StringReader(xmlGood))
        using (var xmlReader = XmlReader.Create(sr, settings))
        {
            var doc = XDocument.Load(xmlReader);
            Console.WriteLine(doc);
        }               
    

    Produces the following output:

    <!DOCTYPE sgml [
      <!ELEMENT sgml ANY>
      <!ENTITY std       "standard SGML">
      <!ENTITY signature " — &author;.">
      <!ENTITY question  "Why couldn’t I publish my books directly in &std;?">
      <!ENTITY author    "William Shakespeare">
    ]>
    <sgml>Why couldn’t I publish my books directly in standard SGML? — William Shakespeare.</sgml>
    

    And as you see the general entities are parsed and expanded.