Search code examples
c#stringparsingescaping

How to parse an escape sequence?


I'm writing a parser for my own markup and I need to handle a few escape sequences but I'm not sure which strategy I should choose.

In particular I have two in my mind.

Here's an example foo \\\<bar baz with two of them: \\ and \<.

When I now scan the string char by char

  1. should I detect the backslash \ and then check if the next character is an excapable one or
  2. should I check for the character and then look back to see whether it's preceded by a backslash \?

Are there any major (dis)advantages in either one?


Solution

  • You need to know where you're at. The way to do that is a state machine. If you're only doing \r, \t, \n, \", and \\, you can get by with a very simple one. Like this (fiddle here):

    public static class StringExtensions
    {
        private enum UnescapeState
        {
            Unescaped,
            Escaped
        }
    
        public static String Unescape(this String s)
        {
            var sb = new System.Text.StringBuilder();
            UnescapeState state = UnescapeState.Unescaped;
    
            foreach (var ch in s)
            {
                switch (state)
                {
                    case UnescapeState.Escaped:
                        switch (ch)
                        {
                            case 't':
                                sb.Append('\t');
                                break;
                            case 'n':
                                sb.Append('\n');
                                break;
                            case 'r':
                                sb.Append('\r');
                                break;
                            
                            case '\\':
                            case '\"':
                                sb.Append(ch);
                                break;
    
                            default:
                                throw new Exception("Unrecognized escape sequence '\\" + ch + "'");
    
                        }
                        state = UnescapeState.Unescaped;
                        break;
    
                    case UnescapeState.Unescaped:
                        if (ch == '\\')
                        {
                            state = UnescapeState.Escaped;
                        }
                        else
                        {
                            sb.Append(ch);
                        }
                        break;
                }
            }
    
            if (state == UnescapeState.Escaped)
            {
                throw new Exception("Unterminated escape sequence");
            }
    
            return sb.ToString();
        }
    }