First question. What's the easiest way in Lua to determine if the last character in a string is not multibyte. Or what's the easiest way to delete the last character from a string.
Here are examples of valid strings, and what I want the output of the function to be
hello there --- result should be: hello ther
anñ --- result should be: an
כראע --- result should be: כרא
ㅎㄹㅇㅇㅅ --- result should be: ㅎㄹㅇㅇ
I need something like
function lastCharacter(string)
--- some code which will extract the last character only ---
return lastChar
end
or if it's easier
function deleteLastCharacter(string)
--- some code which will output the string minus the last character ---
return newString
end
This is the path I was going on
local function lastChar(string)
local stringLength = string.len(string)
local lastc = string.sub(string,stringLength,stringLength)
if lastc is a multibyte character then
local wordTable = {}
for word in string:gmatch("[\33-\127\192-\255]+[\128-\191]*") do
wordTable[#wordTable+1] = word
end
lastc = wordTable[#wordTable]
end
return lastc
end
First of all, note that there are no functions in Lua's string
library that know anything about Unicode/mutlibyte encodings (source: Programming in Lua, 3rd edition). As far as Lua is concerned, strings are simply made up of bytes. It's up to you to figure out which bytes make up a character, if you are using UTF-8 encoded strings. Therefore, string.len
will give you the number of bytes, not the number of characters. And string.sub
will give you a substring of bytes not a substring of characters.
Some UTF-8 basics:
If you need some refreshing on the conceptual basics of Unicode, you should check out this article.
UTF-8 is one possible (and very important) implementation of Unicode - and probably the one you are dealing with. As opposed to UTF-32 and UTF-16 it uses a variable number of bytes (from 1 to 4) to encode each character. In particular, the ASCII characters 0 to 127 are represented with a single byte, so that ASCII strings can be correctly interpreted using UTF-8 (and vice versa, if you only use those 128 characters). All other characters start with a byte in the range from 194 to 244 (which signals that more bytes follow to encode a full character). This range is further subdivided, so that you can tell from this byte, whether 1, 2 or 3 more bytes follow. Those additional bytes are called continuation bytes and are guaranteed to be only taken from the range from 128 to 191. Therefore, by looking at a single byte we know where it stands in a character:
[0,127]
, it's a single-byte (ASCII) character[128,191]
, it's part of a longer character and meaningless on its own[191,244]
, it marks the beginning of a longer character (and tells us how long that character is)This information is enough to count characters, split a UTF-8 string into characters and do all sorts of other UTF-8-sensitive manipulations.
Some pattern matching basics:
For the task at hand we need a few of Lua's pattern matching constructs:
[...]
is a character class, that matches a single character (or rather byte) of those inside the class. E.g. [abc]
matches either a
, or b
or c
. You can define ranges using a hyphen. Therefore [\33-\127]
for example, matches any single one of the bytes from 33
to 127
. Note that \127
is an escape sequence you can use in any Lua string (not just patterns) to specify a byte by its numerical value instead of the corresponding ASCII character. For instance, "a"
is the same as "\97"
.
You can negate a character class, by starting it with ^
(so that it matches any single byte that is not part of the class.
*
repeats the previous token 0 or more times (arbitrarily many times - as often as possible).
$
is an anchor. If it's the last character of the pattern, the pattern will only match at the end of the string.
Combining all of that...
...your problem reduces to a one-liner:
local function lastChar(s)
return string.match(s, "[^\128-\191][\128-\191]*$")
end
This will match a character that is not a UTF-8 continuation character (i.e., that is either single-byte character, or a byte that marks the beginning of a longer character). Then it matches an arbitrary number of continuation characters (this cannot go past the current character, due to the range chosen), followed by the end of the string ($
). Therefore, this will give you all the bytes that make up the last character in the string. It produces the desired output for all 4 of your examples.
Equivalently, you can use gsub
to remove that last character from your string:
function deleteLastCharacter(s)
return string.gsub(s, "[^\128-\191][\128-\191]*$", "")
end
The match is the same, but instead of returning the matched substring, we replace it with ""
(i.e. remove it) and return the modified string.