Search code examples
pythonregexunicode

Regex \b - Devanagiri


I have a regex in Python which uses \b to split words. When I use it on Devanagiri text, I notice that not all characters in the Unicode block are defined as word characters. Certain punctuation marks appear to be defined as non-word characters. This is fundamentally wrong as words in this script can end with these characters.

Is it possible to tell regex to treat the entire block from 0x900 to 0x97f as word characters?

See for example the following regex.

'(?<!\.)(a(?:bc|de)|zip|चाय|पानी)\b'

Here, the first four words abc, ade, zip and चाय are detected at proper word boundaries. The word पानी however, ends with a vowel ी and regex does not treat it as a valid word boundary when ideally it should be.

>>> import re
>>> re.findall(r"(?<!\.)(a(?:bc|de)|zip|चाय|पानी)\b", 'This is abc, ade, चाय, पानी  and abca')
['abc', 'ade', 'चाय']

Can I change this regex behavior and if yes, how?


Solution

  • The problem with the pattern is that \b detects U+093E (DEVANAGARI VOWEL SIGN AA) and 0940 (DEVANAGARI VOWEL SIGN II) as non-word characaters, so the boundaries in the word पानी occur after each consonant and before the dependent vowels.

    It is critical to understand when working with Python regular expressions, with text in Devanagari Script, that the definitions of the re modules \w and \b are fundamentally different from Unicode's definitions.

    The easiest fix is to use the regex module instead. This regex engine has Unicode support unlike the re module.

    import regex as re
    re.findall(r"(?<!\.)(a(?:bc|de)|zip|चाय|पानी)\b", 'This is abc, ade, चाय, पानी  and abca')
    # ['abc', 'ade', 'चाय', 'पानी']