Search code examples
c#.netregexescaping

Handle escape sequences in C#


I have a C# endpoint that takes rawText as string input. The input is send after converting a file to string using 3rd party aspose library, input that is sent is of following format, eg -

{rawText = "\u0007\u0007\r\r\r\r\r\u0007Random Name\rRandom Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com"}

I know strings are UTF16 encoded in C#, so when it reaches the endpoint it is converted to -

requestobj.RawText = "\a\a\r\r\r\r\r\aRandom Name\r10504 Random Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com"

Is my reasoning correct that is due to C# strings being utf16 encoded? and what is the best way to can I remove the \a\a\r\r\r\r\r\a at string begining. I am passing this text to another 3rd party api which does not return correct result with this prepended extra text.

I have tried to use below, but I want a more generic solution for handling all possibilities of \n\r\a etc.

var newText = Regex.Replace(inputValue, @"\\a", "");
inputValue = inputValue.Replace(@"\a", "").Replace(@"\r", "");

Solution

  • Those are escape sequences, not UTF8 encoding. Encoding refers to how characters are converted to bytes. Escape sequences are used to enter characters that are hard to type or invisible in source code. They're also used by debuggers to display such characters. Nothing got converted in the question's case. The same BELL character (0x07) can be represented as both \a or \u0007. The debugger chose the shorter version.

    To replace just these 3 characters at the start you can use this regular expression @"^[\r\n\a]+". To avoid double quoting the escape sequences in the regular expression, a verbatim string can be used which doesn't translate \ as an escape character.

    var input="\a\a\r\r\r\r\r\aRandom Name\r10504 Random Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com";
    var pattern=@"^[\r\n\a]+";
    var newText=Regex.Replace(input,pattern,"");
    

    This produces

    Random Name 10504 Random Address; Overland Park, KS 12345; Cell: 000-000-0000 Email: email1234@gmail.com
    

    To remove characters at any position, remove the start anchor ^.

    It's also possible to replace all control characters. There's a specific Unicode category for control characters with \p{Cc}. Cc is the shorthand for the control character category.

    var pattern=@"\p{Cc}+";
    var newText=Regex.Replace(input,pattern,"");
    

    As the docs explain, this category matches any

    Control code character, with a Unicode value of U+007F or in the range U+0000 through U+001F or U+0080 through U+009F. Signified by the Unicode designation "Cc" (other, control).