Search code examples
unicodeluacjklua-patterns

regex for the pattern of one optional space before Chinese words in lua


I tried use string.match("Í",'%s?[\u{4e00}-\u{9FFF}]+') which is similar to how we work in JS or others. But it will match one unnecessary character like the above 'Í'.

The official implementation of matching UTF-8 is using eacape \ddd but \u{XXX} seems to fail because

Lua's pattern matching facilities work byte by byte

Temporarily, I use the unstable workaround similar to utf8.charpattern: string.match("Í",'%s?[\228-\233][%z\1-\191][%z\1-\191]') based on the utf8 encoding will output nil and works for checking cjk like '我' although it has one wrong range for the 2nd Byte from left.

Q:

How to solve this problem with regex?


Solution

    1. Lua patterns are not regular expressions. Regular expressions have features that Lua patterns don't have (e.g. grouping, possibly nested, and choice), and Lua patterns have feature that regular expressions (at least in the formal linguistic sense) do not have (e.g. %b, %1).
    2. You are right: Lua patterns do not operate on "code points", they operate on bytes. That's why \u{4e00}-\u{9FFF} doesn't work: What Lua sees here is \228\184\128-\233\191\191, equivalent to \184\191\228\128-\233, which is very different from what you want (notably, the range is suddenly from \128 to \233). I consider the interaction of - with multibyte "characters" that appear as a single code point in the sources a bit of a footgun.

    Since you want a pure Lua solution, and given the simplicity of your pattern, a handmade solution is feasible:

    local codepoints = {}
    for _, c in utf8.codes(s) do
        if utf8.char(c):match"^%s$" and codepoints[1] == nil then
            codepoints[1] = c
        elseif c >= 0x4e00 and c <= 0x9FFF then
            table.insert(codepoints, c)
        else
            codepoints = {}
        end
    end
    local match = utf8.char(table.unpack(codepoints))
    if match:match"^%s?$" then match = nil end -- single space or empty string
    

    Edit: Since you want to check for a full match, this can be simplified:

    local match = true
    local got_chinese_character = false
    for p, c in utf8.codes(s) do
        if c >= 0x4e00 and c <= 0x9FFF then
            got_chinese_character = true
        elseif p > 1 or not utf8.char(c):match"^%s$" then
            -- non-chinese character that is not a leading space
            match = false
            break
        end
    end
    match = match and got_chinese_character