Search code examples
c#.netxmlvalidationserialization

How can I extract values as strings from an xml file based on the element/property name in a generated .Net class or the original XSD?


I have a large complex XSD set.

I have C# classes generated from those XSDs using xsd.exe. Naturally, though the majority of properties in the generated classes are strings, many are decimals, DateTimes, enums or bools, just as they should be.

Now, I have some UNVALIDATED data that is structured in the correct XML format, but may well NOT be able to pass XSD validation, let alone be put into an instance of the relevant .Net object. For example, at this stage, for all we know the value for the element that should be a DateTime could be "ABC" - not even parseable as a DateTime - let alone other string elements respecting maxLength or regex pattern restrictions. This data is ready to be passed in to a rules engine that we already have to make everything valid, including defaulting things appropriately depending on other data items, etc.

I know how to use the various types in System.Xml to read the string value of a given element by name. Clearly I could just hand craft code to get out all the elements that exist today by name - but if the XSD changes, the code would need to be reworked. I'd like to be able to either directly read the XSD or use reflection on the generated classes (including attributes like [System.Xml.Serialization.XmlTypeAttribute(TypeName=...] where necessary) to find exactly how to recursively query the XML down to the the raw, unverified string version of any given element to pass through to the ruleset, and then after the rules have made something valid of it, either put it back into the strongly typed object or back into a copy of the XML for serialization into the object.

(It has occurred to me that an alternative approach would be to somehow automatically generate a 'stringly typed' version of the object - where there are not DateTimes etc; nothing but strings - and serialize the xml into that. I have even madly thought of taking the xsd.exe generated .cs file and search/replacing all the enums and base types that aren't strings to strings, but there has to be a better way.)

In other words, is there an existing generic way to pull the XElement or attribute value from some XML that would correspond to a given item in a .Net class if it were serialized, without actually serializing it?


Solution

  • Sorry to self-answer, and sorry for the lack of actual code in my answer, but I don't yet have the permission of my employer to share the actual code on this. Working on it, I'll update here when there is movement.

    I was able to implement something I called a Tolerant XML Reader. Unlike most XML deserializing, it starts by using reflection to look at the structure of the required .Net type, and then attempts to find the relevant XElements and interpret them. Any extra elements are ignored (because they are never looked for), any elements not found are defaulted, and any elements found are further interpreted.

    The main method signature, in C#, is as follows:

    public static T TolerantDeserializeIntoType<T>(
                XDocument doc,
                out List<string> messagesList,
                out bool isFromSuppliedData,
                XmlSchemaSet schemas = null,
                bool tolerant = true)
    

    A typical call to it might look like this:

    List<string> messagesList;
    bool defaultOnly;
    SomeType result = TolerantDeserializeIntoType<SomeType>(someXDocument, out messagesList, out defaultOnly);
    

    (you may use var; I just explicitly put the type there for clarity in this example).

    This will take any XDocument (so the only criteria of the original was that it was well-formed), and make an instance of the specified type (SomeType, in this example) from it.

    Note that even if nothing at all in the XML is recognized, it will still not fail. The new instance will simply have all properties / public fields nulled or defaulted, the MessageList would list all the defaulting done, and the boolean out paramater 'defaultOnly" would be TRUE.

    The recursive method that does all the work has a similar signature, except it takes an XElement instead of an XDocument, and it does not take a schemaSet. (The present implementation also has an explicit bool to indicate a recursive call defaulting to false. This is a slightly dirty way to allow it to gather all failure messages up to the end before throwing an error if tolerant is false; in a future version I will refactor that to only expose publicly a version without that, if I even want to make the XElement version public at all):

    public static T TolerantDeserializeXElementIntoType<T>(
                ref XElement element,
                ref List<string> messagesList,
                out bool isFromSuppliedValue,
                bool tolerant = true,
                bool recursiveCall = false)
    

    How it works, detail

    Starting with the main call, the one with with an XDocument and optional SchemaSet:

    If a schema Set that will compile is supplied (actually, it also looks for xsi:noNamespaceSchemaLocation as well) the initial XDocument and schemaSet call runs a standard XDocument.Validate() across the supplied XDocument, but this only collects any issued validation error callbacks. It won't throw an exception, and is done for only two reasons:

    1. it will give some useful messages for the MessageList, and

    2. it will populate the SchemaInfo of all XElements to possibly use later in the XElement version.

      (note, however, that the schema is entirely optional. It is actually only used to resolve some ambiguous situations where it can be unclear from the C# object if a given XElement is mandatory or not.)

    From there, the recursive XElement version is called on the root node and the supplied C# type.

    I've made the code look for the style of C# objects generated by xsd.exe, though most basic structured objects using Properties and Fields would probably work even without the CustomAttributes that xsd.exe supplies, if the Xml elements are named the same as the properties and fields of the object.

    The method looks for:

    • Arrays
    • Simple value types, explicitly:
      • String
      • Enum
      • Bool
      • then anything else by using the relevant TryParse() method, found by reflection.
      • (note that nulls/xsi:nill='true' values also have to be specially handled)
    • objects, recursively.

    It also looks for a boolean 'xxxSpecified' in the object for each field or property 'xxx' that it finds, and sets it appropriately. This is how xsd.exe indicates an element being omitted from the XML in cases where null won't suffice.

    That's the outline. As I said, I may be able to put actual code somewhere like GitHub in due course. Hope this is helpful to someone.