Search code examples
unicodelualpeg

Matching Unicode punctuation using LPeg


I am trying to create an LPeg pattern that would match any Unicode punctuation inside UTF-8 encoded input. I came up with the following marriage of Selene Unicode and LPeg:

local unicode     = require("unicode")
local lpeg        = require("lpeg")
local punctuation = lpeg.Cmt(lpeg.Cs(any * any^-3), function(s,i,a)
  local match = unicode.utf8.match(a, "^%p")
  if match == nil
    return false
  else
    return i+#match
  end
end)

This appears to work, but it will miss punctuation characters that are a combination of several Unicode codepoints (if such characters exist), as I am reading only 4 bytes ahead, it probably kills the performance of the parser, and it is undefined what the library match function will do, when I feed it a string that contains a runt UTF-8 character (although it appears to work now).

I would like to know whether this is a correct approach or if there is a better way to achieve what I am trying to achieve.


Solution

  • The correct way to match UTF-8 characters is shown in an example in the LPeg homepage. The first byte of a UTF-8 character determines how many more bytes are a part of it:

    local cont = lpeg.R("\128\191") -- continuation byte
    
    local utf8 = lpeg.R("\0\127")
               + lpeg.R("\194\223") * cont
               + lpeg.R("\224\239") * cont * cont
               + lpeg.R("\240\244") * cont * cont * cont
    

    Building on this utf8 pattern we can use lpeg.Cmt and the Selene Unicode match function kind of like you proposed:

    local punctuation = lpeg.Cmt(lpeg.C(utf8), function (s, i, c)
        if unicode.utf8.match(c, "%p") then
            return i
        end
    end)
    

    Note that we return i, this is in accordance with what Cmt expects:

    The given function gets as arguments the entire subject, the current position (after the match of patt), plus any capture values produced by patt. The first value returned by function defines how the match happens. If the call returns a number, the match succeeds and the returned number becomes the new current position.

    This means we should return the same number the function receives, that is the position immediately after the UTF-8 character.