I'm writing a parser for my own markup and I need to handle a few escape sequences but I'm not sure which strategy I should choose.
In particular I have two in my mind.
Here's an example foo \\\<bar baz
with two of them: \\
and \<
.
When I now scan the string char by char
\
and then check if the next character is an excapable one or \
?Are there any major (dis)advantages in either one?
You need to know where you're at. The way to do that is a state machine. If you're only doing \r
, \t
, \n
, \"
, and \\
, you can get by with a very simple one. Like this (fiddle here):
public static class StringExtensions
{
private enum UnescapeState
{
Unescaped,
Escaped
}
public static String Unescape(this String s)
{
var sb = new System.Text.StringBuilder();
UnescapeState state = UnescapeState.Unescaped;
foreach (var ch in s)
{
switch (state)
{
case UnescapeState.Escaped:
switch (ch)
{
case 't':
sb.Append('\t');
break;
case 'n':
sb.Append('\n');
break;
case 'r':
sb.Append('\r');
break;
case '\\':
case '\"':
sb.Append(ch);
break;
default:
throw new Exception("Unrecognized escape sequence '\\" + ch + "'");
}
state = UnescapeState.Unescaped;
break;
case UnescapeState.Unescaped:
if (ch == '\\')
{
state = UnescapeState.Escaped;
}
else
{
sb.Append(ch);
}
break;
}
}
if (state == UnescapeState.Escaped)
{
throw new Exception("Unterminated escape sequence");
}
return sb.ToString();
}
}