Search code examples
c#.netxmldtd

How do you keep .NET XML parsers from expanding parameter entities in XML?


When I try and parse the xml below (with code below) I keep getting <sgml>&question;&signature;</sgml>

expanded to

<sgml>Why couldn’t I publish my books directly in standard SGML? — William Shakespeare.</sgml>

OR

<sgml></sgml>

Since I am working on an XML 3-way Merging algorithm I would like to retrieve the un-expanded <sgml>&question;&signature;</sgml>

I have tried:

  • Parsing the xml normaly (this results in the expanded sgml tag)
  • Removing the Doctype from the beginning on the xml this results in empty sgml tag)
  • Various XmlReader DTD settings

I have the following XML file:

<!DOCTYPE sgml [
  <!ELEMENT sgml ANY>
  <!ENTITY  std       "standard SGML">
  <!ENTITY  signature " &#x2014; &author;.">
  <!ENTITY  question  "Why couldn&#x2019;t I publish my books directly in &std;?">
  <!ENTITY  author    "William Shakespeare">
]>
<sgml>&question;&signature;</sgml>

Here is the code I have tried (several attempts):

using System.IO;
using System.Xml;
using System.Xml.Linq;
using System.Reflection;

class Program
{
    static void Main(string[] args)
    {
        string xml = @"C:\src\Apps\Wit\MergingAlgorithmTest\MergingAlgorithmTest\Tests\XMLMerge-DocTypeExpansion\DocTypeExpansion.0.xml";
        var xmlSettingsIgnore = new XmlReaderSettings 
            {
                CheckCharacters = false,
                DtdProcessing = DtdProcessing.Ignore
            };

        var xmlSettingsParse = new XmlReaderSettings
        {
            CheckCharacters = false,
            DtdProcessing = DtdProcessing.Parse
        };

        using (var fs = File.Open(xml, FileMode.Open, FileAccess.Read))
        {
            using (var xmkReaderIgnore = XmlReader.Create(fs, xmlSettingsIgnore))
            {
                // Prevents Exception "Reference to undeclared entity 'question'"
                PropertyInfo propertyInfo = xmkReaderIgnore.GetType().GetProperty("DisableUndeclaredEntityCheck", BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);
                propertyInfo.SetValue(xmkReaderIgnore, true, null);

                var doc = XDocument.Load(xmkReaderIgnore);

                Console.WriteLine(doc.Root.ToString()); // outputs <sgml></sgml> not <sgml>&question;&signature;</sgml>
            }// using xml ignore

            fs.Position = 0;
            using (var xmkReaderIgnore = XmlReader.Create(fs, xmlSettingsParse))
            {
                var doc = XDocument.Load(xmkReaderIgnore);
                Console.WriteLine(doc.Root.ToString()); // outputs <sgml>Why couldn't I publish my books directly in standard SGML? - William Shakespeare.</sgml> not <sgml>&question;&signature;</sgml>
            }

            fs.Position = 0;
            string parseXmlString = String.Empty;
            using (StreamReader sr = new StreamReader(fs))
            {
                for (int i = 0; i < 7; ++i) // Skip DocType
                    sr.ReadLine();

                parseXmlString = sr.ReadLine();
            }

            using (XmlReader xmlReaderSkip = XmlReader.Create(new StringReader(parseXmlString),xmlSettingsParse))
            {
                // Prevents Exception "Reference to undeclared entity 'question'"
                PropertyInfo propertyInfo = xmlReaderSkip.GetType().GetProperty("DisableUndeclaredEntityCheck", BindingFlags.Instance | BindingFlags.Public | BindingFlags.NonPublic);
                propertyInfo.SetValue(xmlReaderSkip, true, null);

                var doc2 = XDocument.Load(xmlReaderSkip); // Empty sgml tag

            }
        }//using FileStream
    }
}

Solution

  • Linq-to-XML does not support modeling of entity references -- they are automatically expanded to their values (source 1, source 2). There simply is no subclass of XObject defined for a general entity reference.

    However, assuming your XML is valid (i.e. the entity references exist in the DTD, which they do in your example) you can use the old XML Document Object Model to parse your XML and insert XmlEntityReference nodes into your XML DOM tree, rather than expanding the entity references into plain text:

            using (var sr = new StreamReader(xml))
            using (var xtr = new XmlTextReader(sr))
            {
                xtr.EntityHandling = EntityHandling.ExpandCharEntities; // Expands character entities and returns general entities as System.Xml.XmlNodeType.EntityReference
                var oldDoc = new XmlDocument();
                oldDoc.Load(xtr);
                Debug.WriteLine(oldDoc.DocumentElement.OuterXml); // Outputs <sgml>&question;&signature;</sgml>
                Debug.Assert(oldDoc.DocumentElement.OuterXml.Contains("&question;")); // Verify that the entity references are still there - no assert
                Debug.Assert(oldDoc.DocumentElement.OuterXml.Contains("&signature;")); // Verify that the entity references are still there - no assert
            }
    

    the ChildNodes of each XmlEntityReference will have the text value of the general entity. If a general entity refers to other general entities, as one does in your case, the corresponding inner XmlEntityReference will be nested in the ChildNodes of the outer. You can then compare the old and new XML using the old XmlDocument API.

    Note you also need to use the old XmlTextReader and set EntityHandling = EntityHandling.ExpandCharEntities.