Search code examples
regexutf-8ragel

UTF-8 match position


Is it somehow possible to get the character position of matched pattern in Ragel?

I know a match receives a pointer into the string (char *), i.e. the byte-offset where the pattern was found inside of the string. The problem is that UTF-8 is variable-length encoding and thus characters and bytes do not have to align.

For example, if I wanted to search for $ in €€$ I would like to get 2, instead of 6 ($ → 0x24, → 0xE282AC).


Solution

  • Ragel generates a tight piece of source code which is embedded into your favorite language. This code doesn't use any libraries, neither provided by Ragel nor the language standard library. As such, it has no means to parse UTF-8 or calculate a length of a UTF-8 string.

    What it can do, though, is to give you the pointers into the portion of the string you're interested in. Given that, you might calculate it's UTF-8 length using your favorite language-specific tools. For example, in C++ you could use the cxxtools' Utf8Codec::do_length method (or any other library you can think of) to get the UTF-8 length of the €€ piece after Ragel code returns it to you.

    You can also tune Ragel to use 16-bit characters and feed UCS-2 to it, as discussed by Wil Macaulay and Wincent Colaiuta. 32-bit characters with UCS-4 should be even better.

    Yet another angle could be to generate a state machine handing the UTF-8 using the unicode2ragel.rb script and attempt to modify it to count the number of transitions. (I've no idea whether that'll work or not, never used that state machine myself).