Search code examples
stringsplitluastrsplit

Lua split string considering nil entry


str = "cat,dog,,horse"
for word in string.gmatch(str, "([^,'',%s]+)") do
    print(word)
end

This code outputs the following.

cat
dog
horse

I want to consider nil entry as well and want to have the following output.

cat
dog
nil
horse

How can this be done? Could someone please point out?


Solution

  • A few things:

    • nil ~= "". You probably want the empty string rather than nil here. It is however trivial to convert one into the other, so I'll be using the empty string in the following code.
    • You don't need the parentheses around the gmatch pattern. If there are no "captures" (parentheses), the entire pattern is implicitly captured.
    • I'm rather confused about the intent of your pattern. You're matching sequences of one or more non-(whitespace, comma, or single quote) characters; that is, you're splitting on all of whitespace, commata, and single quotes. For some reason, you also have ' and , twice in the character class; just once suffices. I'll be assuming you want to split by ,.

    The issue is that currently your pattern uses the + (one or more) quantifier when you want * (zero or more). Just using * works completely fine on Lua 5.4:

    Lua 5.4.4  Copyright (C) 1994-2022 Lua.org, PUC-Rio
    > local str = "cat,dog,,horse"; for word in str:gmatch"[^,]*" do print(word) end
    cat
    dog
    
    horse
    

    However, there is an issue when you try to run that same code on LuaJIT: It will produce seemingly random empty strings rather than only producing an empty string for two consecutive delimiters (this could be seen as "technically correct" since the empty string is a match for *, but I see it as a violation of the greediness of *). One solution is to require each match to end with a delimiter, appending a delimiter, and matching everything but the delimiter:

    LuaJIT 2.1.0-beta3 -- Copyright (C) 2005-2017 Mike Pall. http://luajit.org/
    JIT: ON SSE2 SSE3 SSE4.1 AMD BMI2 fold cse dce fwd dse narrow loop abc sink fuse
    > local str = "cat,dog,,horse"; for word in (str .. ","):gmatch("(.-),") do print(word) end
    cat
    dog
    
    horse
    

    A third option would be to split manually using repeated calls to string.find. Here's the utility I wrote myself for that:

    function spliterator(str, delim, plain)
        assert(delim ~= "")
        local last_delim_end = 0
    
        -- Iterator of possibly empty substrings between two matches of the delimiter
        -- To exclude empty strings, filter the iterator or use `:gmatch"[...]+"` instead
        return function()
            if not last_delim_end then
                return
            end
    
            local delim_start, delim_end = str:find(delim, last_delim_end + 1, plain)
            local substr
            if delim_start then
                substr = str:sub(last_delim_end + 1, delim_start - 1)
            else
                substr = str:sub(last_delim_end + 1)
            end
            last_delim_end = delim_end
            return substr
        end
    end
    

    The usage in this example would be

    for word in spliterator("cat,dog,,horse", ",") do print(word) end
    

    Whether you want to add this to the string table, keep it in a local variable or perhaps a required string util module is up to you.