Search code examples
parsingf#fparsec

Using preprocessing function with identifier parser in FParsec?


I am using the identifier parser from FParsec to parse the names of variables and functions, which are normally a mixture of Unicode and ASCII characters. But sometimes I have escaped Unicode characters in the beginning (like \u03C0) or within the identifier (like swipe_board\u003A_b). I still can make them parseable using isAsciiIdStart and isAsciiIdContinue options, but I can't define my own custom function for pre-processing before normalization. What could be a solution here?


Solution

  • The identifier parser internally first parses a string and then passes it to an IdentifierValidator instance for validation. Since the C# IdentifierValidator class is publicly accessible (though not documented), you could easily adapt the identifier parser to your needs (by making the initial string parsing step also recognize the escapes).

    The identifier parsing is a bit complicated due to support for UTF-16 surrogate pairs, normalization and the Unicode XID character category, which is not natively supported on .NET. Maybe you only need to support ASCII or UCS-2 identifiers specified in term of character categories supported by CharUnicodeInfo.GetUnicodeCategory, in which case you could probably implement the parsing and validation in just one step using many1Satisfy2 or many1Chars2.