I'm trying to remove all normal and multi-line comments from a string, but it doesn't remove entire multi-line comment I tried
str:gsub("%-%-[^\n\r]+", "")
on this code
print(1)
--a
print(2) --b
--[[
print(4)
]]
output:
print(1)
print(2)
print(4)
]]
expected output:
print(1)
print(2)
The pattern you have provided to gsub
, %-%-[^\n\r]+
, will only remove "short" comments ("line" comments). It doesn't even attempt to deal with "long" comments and thus just treats their first line as a line comment, removing it.
Thus Piglet is right: You must remove the line comments after removing the long comments, not the other way around, as to not lose the start of long comments.
The pattern suggested by Piglet however necessarily fails for some (carefully crafted) long comments or even line comments. Consider
--[this is a line comment]print"Hello World!"
Piglet's pattern would strip the balanced parenthesis, treating the comment as if it were a long comment and uncommenting the rest of the line! We obtain:
print"Hello World!"
in a similar vein, this may happily consider a second line comment part of a long comment, outcommenting your entire code:
--[
-- all my code goes here
print"Hello World!"
-- end of all my code
--]
would be turned into the empty string.
Furthermore, long comments may use multiple equal signs (=
) and must be terminated by the same sequence of equal signs (which is not equivalent to matching square ([]
) brackets):
--[=[
A long long comment
]] <- not the termination of this long long comment
(poor regular-grammar-based syntax highlighters fail this)
]=]
this would terminate the comment at ]]
, leaving some syntax errors:
<- not the termination of this long long comment
(poor regular-grammar-based syntax highlighters fail this)
]=]
considering that Lua 5.1 already deprecates nesting long comments (whereas LuaJIT will entirely reject it), there is no need for matching balanced parenthesis here. Rather, you need to find long comment start sequences and then terminate at the next stop sequence. Here's some hacky pattern-based code to do just this:
for equal_signs in str:gmatch"%-%-%[(=*)%[" do
str = str:gsub("%-%-%["..equal_signs.."%[(.-)%]"..equal_signs.."%]", "", 1)
end
and here's an example string str
for it to process, enclosed in a long string literal for easier testing:
local str = [==[
--[[a "long" comment]]
print"hello world"
--[=[another long comment
--[[this does not disrupt it at all
]=]
--]] oops, just a line comment
--[doesn't care about line comments]
]==]
which yields:
print"hello world"
--]]
--[doesn't care about line comments]
retaining the newlines.
now why is this hacky, despite fixing all of the aforementioned issues? Well, it's inefficient. It runs over the entire source, replacing long comments of a certain length, each time it encounters a long comment. For n long comments this means clear quadratic complexity O(n²).
You can't trivially optimize this by not replacing long comments if you have already replaced all long comments of the same length, reducing the complexity to O(n sqrt n) - since there may be at most sqrt(n) different long comment lengths for sources of length n: The gsub
is limited to one replacement as to not remove part of long comments with more equal signs:
--[=[another long comment
--[[this does not disrupt it at all
]=]
You could however optimize it by using string.find
repeatedly to always find (1) the opening delimiter (2) then the closing delimiter, adding all the substrings inbetween to a rope to concatenate to a string. Assuming linear matching performance (which isn't the case but could - assuming a better implementation than the current one - be the case for simple patterns such as this one) this would run in linear time. Implementing this is left as an excercise to the reader as pattern-based approaches are overall infeasible.
Note also that removing comments (to minify code?) may introduce syntax errors, as at the tokenization stage, comment (or whitespace) tokens (which are later suppressed) might be used to separate other tokens. Consider the following pathological case:
do--[[]]print("hello world")end
which would be turned into
doprint("hello world")end
which is an entirely different beast (call to doprint
now, syntax error since the end
isn't matched by an opening do
anymore).
In addition, any pattern-based solution is likely to fail to consider context, removing "comments" inside string literals or - even harder to work around - long string literals. Again workarounds might be possible (i.e. by replacing strings with placeholders and later substituting them back), but this gets messy & error-prone. Consider
quoted_string = "--[[this is no comment but rather part of the string]]"
long_string = [=[--[[this is no comment but rather part of the string]]]=]
which would be turned into an empty string by comment removal patterns.
if--[[comment]]"str"then end
is fine, since the string will still be considered a distinct token from the keyword if
).