Search code examples
c#xmlregex.net-fiddle

.NET fiddle/Visual Studio: Different results for regex replace on invalid XML character


I'm trying to filter invalid characters from an XML file, and have the following test project;

class Program
{
    private static Regex _invalidXMLChars = new Regex(@"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]", RegexOptions.Compiled);

    static void Main(string[] args)
    {
        var text = "assd&#xF;abv";

        Console.WriteLine(_invalidXMLChars.IsMatch(text));
    }
}

This test project outputs the expected result (True) with .NET fiddle;

But when I try to implement the same code in my project, the invalid characters are not found and outputs "False".

How come this works in .NET fiddle, but not in my project?

Altering the source XML file is not an option


Solution

  • Visual Studio is right. None of the characters &, #, x, F or ; are part of your Regex. However, in HTML &#xF; translates to the C# pendant \u000f which then is replaced due to the Regex definition \0xE-\0x1F.

    Using \u000f in Visual Studio gives a match:

    using System;
    using System.Text.RegularExpressions;
    
    public class Program
    {
        private static Regex _invalidXMLChars = new Regex(@"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]", RegexOptions.Compiled);
    
        public static void Main()
        {
            var text = "assd\u000fabv";
            Console.WriteLine(_invalidXMLChars.IsMatch(text));
        }
    }