Search code examples
c#xmlxdoc

Remove all text not wrapped in XML braces


I want to remove all invalid text from an XML document. I consider any text not wrapped in <> XML brackets to be invalid, and want to strip these prior to translation.

From this post Regular expression to remove text outside the tags in a string - it explains how to match XML brackets together. However on my example it doesn't clean up the text outside of the XML as can be seen in this example. https://regex101.com/r/6iUyia/1

I dont think this specific example has been asked on S/O before from my initial research.

Currently in my code, I have this XML as a string, before I compose an XDocument from it later on. So I potentially have string, Regex and XDocument methods available to assist in removing this, there could additionally be more than one bit of invalid XML present in these documents. Additionally, I do not wish to use XSLT to remove these values.

One of the very rudimentary idea's I tried and failed to compose, was to iterate over the string as a char array, and attempting to remove it if it was outside of '>' and '<' but decided there must be a better way to achieve this (hence the question)

This is an example of the input, with invalid text being displayed between nested-A and nested-B

 <ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
   <A>
         <nested-A>valid text</nested-A>
         Remove text not inside valid xml braces
         <nested-B>more valid text here</nested-B>
   </A>
</ASchema>

I expect the output to be in a format like the below.

 <ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
   <A>
         <nested-A>valid text</nested-A>
         <nested-B>more valid text here</nested-B>
   </A>
</ASchema>

Solution

  • You could do the following . Please note I have done very limited testing, kindly let me know if it fails in some scenarios .

    XmlDocument doc = new XmlDocument();
    doc.LoadXml(str);
    var json = JsonConvert.SerializeXmlNode(doc);
    
    string result = JToken.Parse(json).RemoveFields().ToString(Newtonsoft.Json.Formatting.None);
    var xml = (XmlDocument)JsonConvert.DeserializeXmlNode(result);
    

    Where RemoveFields are defined as

    public static class Extensions
    {
    public static JToken RemoveFields(this JToken token)
    {
        JContainer container = token as JContainer;
        if (container == null) return token;
    
        List<JToken> removeList = new List<JToken>();
        foreach (JToken el in container.Children())
        {
            JProperty p = el as JProperty;
            if (p != null && p.Name.StartsWith("#"))
            {
                removeList.Add(el);
            }
            el.RemoveFields();
        }
    
        foreach (JToken el in removeList)
            el.Remove();
    
        return token;
    }
    }
    

    Output

    <ASchema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xdt="http://www.w3.org/2005/xpath-datatypes" xmlns:fn="http://www.w3.org/2005/xpath-functions">
       <A>
          <nested-A>valid text</nested-A>
          <nested-B>more valid text here</nested-B>
       </A>
    </ASchema>
    

    Please note am using Json.net in above code