Search code examples
c#xmlescapingentityreference

Conditionally escape special xml characters


I have looked around a lot but have not been able to find a built-in .Net method that will only escape special XML characters: <, >, &, ' and " if it's not a tag.

For example, take the following text:

Test& <b>bold</b> <i>italic</i> <<Tag index="0" />

I want it to be converted to:

Test&amp; <b>bold</b> <i>italic</i> &lt;<Tag index="0" />

Notice that the tags are not escaped. I basically need to set this value to an InnerXML of an XmlElement and as a result, those tags must be preserved.

I have looked into implementing my own parser and use a StringBuilder to optimize it as much as I can but it can get pretty nasty.

I also know the tags that are acceptable which may simplify things (only: br, b, i, u, blink, flash, Tag). In addition, these tags can be self closing tags

(e.g. <u />)

or container tags

(e.g. <u>...</u>)

Solution

  • NOTE: This could probably be optimised. It was just something I knocked up quickly for you. Also note that I am not doing any validation of the tags themselves. It's just looking for content wrapped in angle brackets. It will also fail if an angle bracket was found within the tag (e.g. <sometag label="I put an > here"> ). Other than that, I think it should do what you're asking for.

    namespace ConsoleApplication1
    {
        using System;
        using System.Text.RegularExpressions;
    
        class Program
        {
            static void Main(string[] args)
            {
                // This is the test string.
                const string testString = "Test& <b>bold</b> <i>italic</i> <<Tag index=\"0\" />";
    
                // Do a regular expression search and replace. We're looking for a complete tag (which will be ignored) or
                // a character that needs escaping.
                string result = Regex.Replace(testString, @"(?'Tag'\<{1}[^\>\<]*[\>]{1})|(?'Ampy'\&[A-Za-z0-9]+;)|(?'Special'[\<\>\""\'\&])", (match) =>
                    {
                        // If a special (escapable) character was found, replace it.
                        if (match.Groups["Special"].Success)
                        {
                            switch (match.Groups["Special"].Value)
                            {
                                case "<":
                                    return "&lt;";
                                case ">":
                                    return "&gt;";
                                case "\"":
                                    return "&quot;";
                                case "\'":
                                    return "&apos;";
                                case "&":
                                    return "&amp;";
                                default:
                                    return match.Groups["Special"].Value;
                            }
                        }
    
                        // Otherwise, just return what was found.
                        return match.Value;
                    });
    
                // Show the result.
                Console.WriteLine("Test String: " + testString);
                Console.WriteLine("Result     : " + result);
                Console.ReadKey();
            }
        }
    }