Search code examples
parsingscalaescapingdecodeunicode-escapes

how to decode Java strings with Unicode escapes etc. from Scala JavaTokenParsers into unescaped strings?


JavaTokenParsers in Scala provides convenient regexps for matching integer and floating-point numbers, and double-quoted strings. But that's ALL it does. How do I do the obvious thing of converting these strings back into the underlying converting objects? This is pretty easy to do for numbers, using toDouble or toInt, etc. But how do you do the equivalent for strings? E.g. If I type the string

"Unicode \u20ac is a Euro sign, which I would write \\u20ac in a string. \243 is a pound sign.\n\r And \f is a \"form feed\", with embedded quotes.\n\r"

And then I run this through JavaTokenParsers, I'll duly get a string back that correctly parses the embedded quotes, but has a double quote character as its first and last characters, and lots of backslash sequences. How do I get the equivalent Java string with the escape sequences processed? I can't believe there's no library function to do this, but can't find one.


Solution

  • It seems that there is no such function—at least, none is used in the Scala compiler. That's not a conclusive answer though, maybe a library function was introduced afterwards.

    In case you want to read (or copy-n-paste) this code, here's the related code I found. The tokenization logic of the Scala compiler is distributed among different files. The top level method seems to be fetchToken in src/compiler/scala/tools/nsc/ast/parser/Scanners.scala, which in turn delegates to logic in src/compiler/scala/tools/nsc/util/CharArrayReader.scala (one of its ancestors), in particular nextChar and potentialUnicode. Other escapes are handled in getLitChar, again in Scanners.scala.