Search code examples
regexpcrecyrillic

RegEx. \b for Cyrillic symbols


Tell me please, what can be used instead of \b to highlight words in the cyrillic text?

I have a text "текст" in SQLite database column.

it's working:

select * from myTable where text REGEXP 'текст'

it's not working:

select * from myTable where text REGEXP '\bтекст\b'

Solution

  • It turns out your SQLite REGEXP implementation is based on PCRE.

    You may make the \b Unicode aware by using a (*UCP) PCRE verb:

    '(*UCP)\bтекст\b'
    

    There is some details about the verb at pcrepattern man page:

    Another special sequence that may appear at the start of a pattern is (*UCP). This has the same effect as setting the PCRE_UCP option: it causes sequences such as \d and \w to use Unicode properties to determine character types, instead of recognizing only characters with codes less than 128 via a lookup table.

    And later:

    Note also that PCRE_UCP affects \b, and \B because they are defined in terms of \w and \W. Matching these sequences is noticeably slower when PCRE_UCP is set.

    Well, it will be slower since it has to deal with the whole Unicode table now.