I need to get the first char of a text variable. I achieve this with one of the following simple methods:
string.sub(someText,1,1)
or
someText:sub(1,1)
If I do the following, I expect to get 'ñ'
as the first letter. However, the result of either of the sub
methods is 'Ã'
local someText = 'ñññññññ'
print('Test whole: '..someText)
print('first char: '..someText:sub(1,1))
print('first char with .sub: '..string.sub(someText,1,1))
Here are the results from the console:
2014-03-02 09:08:47.959 Corona Simulator[1701:507] Test whole: ñññññññ
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char: Ã
2014-03-02 09:08:47.960 Corona Simulator[1701:507] first char with .sub: Ã
It seems like the string.sub()
function is encoding the returned value in UTF-8. Just for kicks I tried using the utf8_decode()
function that's provided by Corona SDK. It was not successful. The simulator indicated that the function expected a number but got nil
instead.
I also searched the web to see if anyone else had ran into this issue. I found out that there is a fair amount of discussion on Lua, Corona, Unicode and UTF-8 but I did not come across anything that would address this specific problem.
Lua strings are 8-bit clean, which means strings in Lua are a stream of bytes. The UTF-8 character ñ
has multiple bytes, but someText:sub(1,1)
returns only the first single byte.
For UTF-8 encoding, all characters in the ASCII range have the same representation as in ASCII, that is, a single byte smaller than 128. For other CodePoints, a sequences of bytes where the first byte is in the range 194-244 and continuation bytes are in the range 128-191.
Because of this, you can use the pattern ".[\128-\191]*"
to match a single UTF-8 CodePoint (not Grapheme):
for c in "ñññññññ":gmatch(".[\128-\191]*") do -- pretend the first string is in NFC
print(c)
end
Output:
ñ
ñ
ñ
ñ
ñ
ñ
ñ