Search code examples
pythonregexpcrepcre2

Python and PCRE regex that are the same give different outputs for the same input


I am trying to implement the minbpe library in zig, using a wrapper over PCRE library.

The pattern in Python is r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

When I use the pattern with a UTF-8 encoded text like abcdeparallel १२४, I get the following output:

>>> import regex as re
>>> p = re.compile(r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
>>> p
regex.Regex("'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
>>> p.findall("abcdeparallel १२४")
['abcdeparallel', ' १२४']

It looks like this is more or less the same in PCRE flavored regex as well, with me just having to add a /g flag in the end for UTF-8 matching

However when I try to use the pattern with pcre via the pcre2test tool on macOS, I get a much different output

$ pcre2test -8
PCRE2 version 10.42 2022-12-11
  re> /'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
data> abcdeparallel १२४
 0: abcdeparallel
 0:  \xe0
 0: \xa5\xa7
 0: \xe0
 0: \xa5\xa8
 0: \xe0
 0: \xa5
 0: \xaa

Somehow it looks like the code points for the Hindi numerals (1, 2 4) are interpreted differently and the output is matched as a totally different set of characters

>>> "\xe0\xa5\xa7\xe0\xa5\xa8"
'१२'

Is there a flag or something that I am missing that must be passed to have the same behaviour as the the regex Package/module from Python ? When UTF-8 code points are decoded into bytes, wouldn't the library know how to put them back together into the same code points ?


Solution

  • The Hindi codepoints are actually matched, but rendered on screen as UTF-8 hexcodes:

    >>> "१२४".encode("utf-8")
    b'\xe0\xa5\xa7\xe0\xa5\xa8\xe0\xa5\xaa'
    

    According to the pcr2test spec:

    When pcre2test is outputting text in the compiled version of a pattern, bytes other than 32-126 are always treated as non-printing characters and are therefore shown as hex escapes.

    When pcre2test is outputting text that is a matched part of a subject string, it behaves in the same way, unless a different locale has been set for the pattern (using the locale modifier). In this case, the isprint() function is used to distinguish printing and non-printing characters.

    The spec doesn't mention which locales can be used. The example (fr_FR) suggests two-letter country code and two-letter language code, but it's unclear to me if Hindi is supported.

    With the `(*UTF) flag you do get two matches and the Hindi numerals are then rendered as unicode hexes:

    re> /(*UTF)(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
    data> abcdeparallel १२४
     0: abcdeparallel
     0:  \x{967}\x{968}\x{96a}