I tried use string.match("Í",'%s?[\u{4e00}-\u{9FFF}]+')
which is similar to how we work in JS or others. But it will match one unnecessary character like the above 'Í'.
The official implementation of matching UTF-8 is using eacape \ddd
but \u{XXX}
seems to fail because
Lua's pattern matching facilities work byte by byte
Temporarily, I use the unstable workaround similar to utf8.charpattern
: string.match("Í",'%s?[\228-\233][%z\1-\191][%z\1-\191]')
based on the utf8 encoding will output nil
and works for checking cjk like '我' although it has one wrong range for the 2nd Byte from left.
Q:
How to solve this problem with regex?
%b
, %1
).\u{4e00}-\u{9FFF}
doesn't work: What Lua sees here is \228\184\128-\233\191\191
, equivalent to \184\191\228\128-\233
, which is very different from what you want (notably, the range is suddenly from \128
to \233
). I consider the interaction of -
with multibyte "characters" that appear as a single code point in the sources a bit of a footgun.Since you want a pure Lua solution, and given the simplicity of your pattern, a handmade solution is feasible:
local codepoints = {}
for _, c in utf8.codes(s) do
if utf8.char(c):match"^%s$" and codepoints[1] == nil then
codepoints[1] = c
elseif c >= 0x4e00 and c <= 0x9FFF then
table.insert(codepoints, c)
else
codepoints = {}
end
end
local match = utf8.char(table.unpack(codepoints))
if match:match"^%s?$" then match = nil end -- single space or empty string
Edit: Since you want to check for a full match, this can be simplified:
local match = true
local got_chinese_character = false
for p, c in utf8.codes(s) do
if c >= 0x4e00 and c <= 0x9FFF then
got_chinese_character = true
elseif p > 1 or not utf8.char(c):match"^%s$" then
-- non-chinese character that is not a leading space
match = false
break
end
end
match = match and got_chinese_character