Search code examples
.netxmlvb.netxhtml

Parse XHTML document with undefined entity


While coding with Python, if I had to load XHTML document with undefined entity, I would create a parser and update entity dict (i.e. nbsp):

import xml.etree.ElementTree as ET
parser = ET.XMLParser()
parser.entity['nbsp'] = ' '
tree = ET.parse(opener.open(url), parser=parser)

With VB.Net I tried to parse XHTML document as Linq XDocument:

Dim x As XDocument = XDocument.Load(url)

which raised XmlException:

Reference to undeclared entity 'nbsp'

Googling around I couldn't find any example how to update entity table or use simple means to be able to parse XHTML document with undefined entity.

How to solve this apparently simple problem?


Solution

  • Entity resolution is done by the underlying parser which is here a standard XmlReader (or XmlTextReader).

    Officially, you're supposed to declare entities in DTDs (see Oleg's answer here: Problem with XHTML entities), or load DTDs dynamically into your documents. There are some examples here on SO like this: How do I resolve entities when loading into an XDocument?

    What you can also do is create a hacky XmlTextReader derived class that returns Text nodes when entities are detected, based on a dictionary, like I demonstrate here in the following sample code:

    using (XmlTextReaderWithEntities reader = new XmlTextReaderWithEntities(MyXmlFile))
    {
        reader.AddEntity("nbsp", "\u00A0");
        XDocument xdoc = XDocument.Load(reader);
    }
    
    ...
    
    public class XmlTextReaderWithEntities : XmlTextReader
    {
        private string _nextEntity;
        private Dictionary<string, string> _entities = new Dictionary<string, string>();
    
        // NOTE: override other constructors for completeness
        public XmlTextReaderWithEntities(string path)
            : base(path)
        {
        }
    
        public void AddEntity(string entity, string value)
        {
            _entities[entity] = value;
        }
    
        public override bool Read()
        {
            if (_nextEntity != null)
                return true;
    
            return base.Read();
        }
    
        public override XmlNodeType NodeType
        {
            get
            {
                if (_nextEntity != null)
                    return XmlNodeType.Text;
    
                return base.NodeType;
            }
        }
    
        public override string Value
        {
            get
            {
                if (_nextEntity != null)
                {
                    string value = _nextEntity;
                    _nextEntity = null;
                    return value;
                }
                return base.Value;
            }
        }
    
        public override void ResolveEntity()
        {
            // if not found, return the string as is
            if (!_entities.TryGetValue(LocalName, out _nextEntity))
            {
                _nextEntity = "&" + LocalName + ";";
            }
            // NOTE: we don't use base here. Depends on the scenario
        }
    }
    

    This approach works in simple scenarios, but you may need to override some other stuff for completeness.

    PS: sorry it's in C#, you'll have to adapt to VB.NET :)