Search code examples
luastring-matching

Lua Multi-Line comment remover


I'm trying to remove all normal and multi-line comments from a string, but it doesn't remove entire multi-line comment I tried

str:gsub("%-%-[^\n\r]+", "")

on this code

print(1)
--a
print(2) --b
--[[
    print(4)
]]

output:

print(1)

print(2) 

    print(4)
]]

expected output:

print(1)

print(2)

Solution

  • The pattern you have provided to gsub, %-%-[^\n\r]+, will only remove "short" comments ("line" comments). It doesn't even attempt to deal with "long" comments and thus just treats their first line as a line comment, removing it.

    Thus Piglet is right: You must remove the line comments after removing the long comments, not the other way around, as to not lose the start of long comments.


    The pattern suggested by Piglet however necessarily fails for some (carefully crafted) long comments or even line comments. Consider

    --[this is a line comment]print"Hello World!"
    

    Piglet's pattern would strip the balanced parenthesis, treating the comment as if it were a long comment and uncommenting the rest of the line! We obtain:

    print"Hello World!"
    

    in a similar vein, this may happily consider a second line comment part of a long comment, outcommenting your entire code:

    --[
    -- all my code goes here
    print"Hello World!"
    -- end of all my code
    --]
    

    would be turned into the empty string.

    Furthermore, long comments may use multiple equal signs (=) and must be terminated by the same sequence of equal signs (which is not equivalent to matching square ([]) brackets):

    --[=[
    A long long comment
    ]] <- not the termination of this long long comment
    (poor regular-grammar-based syntax highlighters fail this)
    ]=]
    

    this would terminate the comment at ]], leaving some syntax errors:

     <- not the termination of this long long comment
    (poor regular-grammar-based syntax highlighters fail this)
    ]=]
    

    considering that Lua 5.1 already deprecates nesting long comments (whereas LuaJIT will entirely reject it), there is no need for matching balanced parenthesis here. Rather, you need to find long comment start sequences and then terminate at the next stop sequence. Here's some hacky pattern-based code to do just this:

    for equal_signs in str:gmatch"%-%-%[(=*)%[" do
        str = str:gsub("%-%-%["..equal_signs.."%[(.-)%]"..equal_signs.."%]", "", 1)
    end
    

    and here's an example string str for it to process, enclosed in a long string literal for easier testing:

    local str = [==[
    --[[a "long" comment]]
    print"hello world"
    --[=[another long comment
    --[[this does not disrupt it at all
    ]=]
    --]] oops, just a line comment
    --[doesn't care about line comments]
    ]==]
    

    which yields:

    
    print"hello world"
    
    --]]
    --[doesn't care about line comments]
    
    

    retaining the newlines.

    now why is this hacky, despite fixing all of the aforementioned issues? Well, it's inefficient. It runs over the entire source, replacing long comments of a certain length, each time it encounters a long comment. For n long comments this means clear quadratic complexity O(n²).

    You can't trivially optimize this by not replacing long comments if you have already replaced all long comments of the same length, reducing the complexity to O(n sqrt n) - since there may be at most sqrt(n) different long comment lengths for sources of length n: The gsub is limited to one replacement as to not remove part of long comments with more equal signs:

    --[=[another long comment
    --[[this does not disrupt it at all
    ]=]
    

    You could however optimize it by using string.find repeatedly to always find (1) the opening delimiter (2) then the closing delimiter, adding all the substrings inbetween to a rope to concatenate to a string. Assuming linear matching performance (which isn't the case but could - assuming a better implementation than the current one - be the case for simple patterns such as this one) this would run in linear time. Implementing this is left as an excercise to the reader as pattern-based approaches are overall infeasible.


    Note also that removing comments (to minify code?) may introduce syntax errors, as at the tokenization stage, comment (or whitespace) tokens (which are later suppressed) might be used to separate other tokens. Consider the following pathological case:

    do--[[]]print("hello world")end
    

    which would be turned into

    doprint("hello world")end
    

    which is an entirely different beast (call to doprint now, syntax error since the end isn't matched by an opening do anymore).


    In addition, any pattern-based solution is likely to fail to consider context, removing "comments" inside string literals or - even harder to work around - long string literals. Again workarounds might be possible (i.e. by replacing strings with placeholders and later substituting them back), but this gets messy & error-prone. Consider

    quoted_string = "--[[this is no comment but rather part of the string]]"
    long_string = [=[--[[this is no comment but rather part of the string]]]=]
    

    which would be turned into an empty string by comment removal patterns.


    Conclusion

    1. Pattern-based solutions are bound to fall short of myriads of edge cases. They will also usually be inefficient.
    2. At least a partial tokenization that distinguishes between comments and "everything else" is needed. This must take care of long strings & long comments properly, counting the number of equals signs. Using a handwritten tokenizer is possible, but I'd recommend using lhf's ltokenp.
    3. Even when using a proper tokenization stage to strip long comments, you might still have the aforementioned tokenization issue. For that reason you'll have to insert whitespace instead of the comment (if there isn't already). To save the most space you could check whether removing the comment alters the tokenization (i.e. removing the comment here if--[[comment]]"str"then end is fine, since the string will still be considered a distinct token from the keyword if).
    4. What's your root problem here? If you're searching for a Lua minifier, just grab a battle-tested one rather than trying to roll your own (and especially before you try to rename local variables using patterns!).