I am using the identifier
parser from FParsec to parse the names of variables and functions, which are normally a mixture of Unicode and ASCII characters. But sometimes I have escaped Unicode characters in the beginning (like \u03C0
) or within the identifier (like swipe_board\u003A_b
). I still can make them parseable using isAsciiIdStart
and isAsciiIdContinue
options, but I can't define my own custom function for pre-processing before normalization. What could be a solution here?
The identifier
parser internally first parses a string and then passes it to an IdentifierValidator
instance for validation. Since the C# IdentifierValidator
class is publicly accessible (though not documented), you could easily adapt the identifier
parser to your needs (by making the initial string parsing step also recognize the escapes).
The identifier parsing is a bit complicated due to support for UTF-16 surrogate pairs, normalization and the Unicode XID character category, which is not natively supported on .NET.
Maybe you only need to support ASCII or UCS-2 identifiers specified in term of character categories supported by CharUnicodeInfo.GetUnicodeCategory
, in which case you could probably implement the parsing and validation in just one step using many1Satisfy2
or many1Chars2
.