Search code examples
regexstringlualua-patterns

Getting string characters inside, and outside a set of brackets with string patterns in Lua?


I'm trying to create a string pattern that will match both non-space characters, and all characters inside a set of brackets. For example, a sequence such as this:

local str = [[
    This [pattern] should [return both] non-space 
    characters and [everything inside] brackets
]]

Would print out This, [pattern], should, [return both], non-space ... etc. I've been going at this for a while, and came up with a very close solution that I know the problem to, but cannot seem to solve. Here's my attempt:

local str = [[
    This [pattern] should [return both] non-space 
    characters and [everything inside] brackets
]]

for s in string.gmatch(str, "%S+%[?.-%]?") do
    print(s)
end

The issue is that spaces should be allowed inside the brackets, but not outside. This would print something like: This, [pattern], should, [return, both], non-space ... etc

Notice that [return and both] are two different captures, opposed to returning [return both]. I'm still sort of new to string patterns, so I feel like there's a few options I could be overlooking. Anyway, if anyone is experienced with this sort of thing, I sure would appreciate some insight.


Solution

  • Just to explain Egor's solution in the comment a bit, the key idea is to differentiate between whitespaces that are inside the brackets [] from the ones that are outside. This is accomplished by

    • first gsubing the whitespaces outside the brackets replacing it with \0.
    • followed by gmatch over the string matching against non-null characters.

    The null char \0 is used as a sentinel since it's unlikely to clash with a legitimate character in the input text.

    A variation to this approach is to replace the whitespace inside the bracket instead followed with matches against non-whitespace characters

    for s in str:gsub("(%[.-%])",
                      function(x)
                        return x:gsub("%s+", "\0") 
                      end)
                :gmatch "%S+"
    do
      print( (s:gsub("%z+", " ")) )
    end
    

    Note that you are creating intermediate strings during the parse. If the input string is long then so is the temporary intermediate string. For one-off matches this is probably okay. If you're dealing with more heavy-duty parsing I suggest checking out LPEG.

    For example, the following lpeg.re grammar can parse the given input text

    local re = require 're'
    
    local str =
    [[
        This [pattern] should [return both] non-space 
        characters and [everything inside brackets]
    ]]
    
    local pat = re.compile
      [[
        match_all   <- %s* match_piece+ !.
        match_piece <- {word / bracket_word} %s*
        word        <- ([^]%s[])+
        bracket_word <- '[' (word %s*)+ ']'
      ]]
    
    for _, each in ipairs{ pat:match(str) } do
      print(each)
      end
    

    Outputs:

    This
    [pattern]
    should
    [return both]
    non-space
    characters
    and
    [everything inside brackets]