Search code examples
c#regexunicodehex

Regular expression to match the string representation of non ascii characters \u0000-\u007F from a string and replace with empty string in C#?


I am getting hexadecimal representation of unicode characters in my string and want to replace that with empty string. More specifically, trying to match all values within \u0000-\u007F in a string using regex to replace it with empty string with C#.

Example 1:

InputString: "\u007FTestString"

ExpectedResult: TestString

Example 2:

InputString: "\u007FTestString\U0000"

ExpectedResult: TestString

My current solution does

            if (!string.IsNullOrWhiteSpace(testString))
            {
                return Regex.Replace(testString, @"[^\u0000-\u007F]", string.Empty);
            }

does not match the hexadecimal representation of the non-ascii character. How do i get it to match the \u0000-\u007F in the string ?

Any help is appreciated. Thank you!


Solution

  • You can use

    var result = Regex.Replace(@"\u007FTestString\U0000", @"\\[uU]00[0-7][0-9A-Fa-f]", "");
    

    The @"..." verbatim string literal syntax is required to make all backslashes literal characters that do not form any string escape sequences.

    Pattern details:

    • \\ - a backslash
    • [uU] - u or U
    • 00 - two zeros
    • [0-7] - a digit from zero to seven
    • [0-9A-Fa-f] - an ASCII hex digit char.