Search code examples
c#xmlxmlreader

How to check if another node exists in an xml without reading - C#


I would like to implement code that deserializes an xml into a list of objects. I found a problem in the code where the while reads forward so every other node is skipped. What is the proper way to check for a next node in an xml to be implemented in the while loop of this code?

private Task<List<TAxEntity>> Deserialize(XmlReader reader)
    {
        var deserializer = new XmlSerializer(typeof(TAxEntity));
        var entities = new List<TAxEntity>();

        do
        {
            using (var stringReader = new StringReader(reader.ReadOuterXml()))
            {
                var entity = (TAxEntity)deserializer.Deserialize(stringReader);

                entities.Add(entity);
            }
        }
        while (reader.ReadToNextSibling(EntityElementName));

        return Task.FromResult(entities);
    }

Solution

  • To check that an XmlReader is already correctly positioned, you can check whether reader.NodeType == XmlNodeType.Element and reader.Name == EntityElementName. Then, if the reader is already correctly positioned, do not scan forward using ReadToNextSibling().

    However, there are a few improvements to be made to your algorithm:

    1. Instead of checking for the correct reader.Name, check whether the LocalName and NamespaceURI are as expected, and if not, call reader.ReadToNextSibling(string localName,string namespaceURI). This avoids hardcoding of namespace prefixes, which is a bug to be avoided.

    2. Rather than ReadOuterXml(), call reader.ReadSubtree() and pass the returned reader directly to deserializer.Deserialize(). Your current algorithm parses the XML, reformats it into a second XML string, then parses that string a second time. Using ReadSubtree() allows the XmlSerializer to stream a nested element directly from the incoming XmlReader and so avoids this extra parsing and reformatting.

    Putting all this together, you can introduce the following lower-level extension method:

    public static class XmlReaderExtensions
    {
        public static IEnumerable<TElement> DeserializeSequence<TElement>(this XmlReader reader, string localEntityElementName, string namespaceURI)
        {
            if (reader == null)
                throw new ArgumentNullException();
            var deserializer = new XmlSerializer(typeof(TElement));
            while ((reader.NodeType == XmlNodeType.Element && reader.LocalName == localEntityElementName && reader.NamespaceURI == namespaceURI)
                || reader.ReadToNextSibling(localEntityElementName, namespaceURI))
            {
                // Using ReadSubtree instead of ReadOuterXml() avoids having do parse, reformat, then parse the formatted XML a second time
                // by reading directly from the current stream only once.
                TElement element;
                using (var subReader = reader.ReadSubtree())
                {
                    element = (TElement)deserializer.Deserialize(subReader);
                }
                // Consume the EndElement also (or move past the current element if reader.IsEmptyElement).
                reader.Read();
                yield return element;
            }
        }
    }
    

    And modify your Deserialize() method to be as follows:

        private Task<List<TAxEntity>> Deserialize(XmlReader reader)
        {
            var entities = reader.DeserializeSequence<TAxEntity>(EntityElementName, "" /* Pass the correct namespace here */).ToList();
    
            return Task.FromResult(entities);
        }       
    

    Sample .Net fiddle.

    Note that any manual XmlReader code should be unit-tested with both indented and unindented XML, since bugs that involve skipping nodes are sometimes masked when parsing indented XML (because the whitespace nodes get skipped.)