I recently found myself needing to generate a simple Ruby script based on user input, some of which needs to be included in the script as a string literal. While in my specific case this input comes from a trusted source, I would still like to do this in a way that will not break even if the input string happens to contain e.g. quotes, backslashes, newlines, hash marks or other unexpected metacharacters.
The obvious solution (as recommended in the accepted answer to this earlier question) would be to use the String#inspect
method, whose documentation says that it:
Returns a printable version of str, surrounded by quote marks, with special characters escaped.
The documentation, however, stops just short of explicitly stating that evaluating the output of String#inspect
as Ruby code will return the original string. And, in fact, I did technically manage to come up with a counterexample using non-Unicode strings:
pry(main)> str = 0x80.chr; eval(str.inspect) == str
=> false
However, all the strings I need to encode are Unicode strings, so this counterexample is of only theoretical interest to me. But I'd still like some documented guarantee, hence the following questions:
eval(str.inspect)
guaranteed to be equal to str
, if str
is a Unicode string?Also, a bonus question:
eval("'" + str.gsub(/[\\']/, { "\\" => "\\\\", "'" => "\\'" }) + "'")
always guaranteed to equal str
?Let me try to summarize the results of my investigation so far (including Max's now-deleted answer that introduced me to String#dump
):
The documentation for String#inspect
does not guarantee that eval
ing its output yields the original string. However, at least as of Ruby 3.0.2, the documentation for String#dump
does make that guarantee:
This method can be used for round-trip: if the resulting
new_str
is eval'ed, it will produce the original string.
Thus, it appears that the answers to my questions #1 and #2 are:
No, eval(str.inspect)
is not guaranteed to equal str
by the Ruby documentation (although in practice it does seem to work; see below).
OTOH, eval(str.dump)
is documented to always equal str
.
Of course, while documentation is nice to have, it's also a good idea to make sure that the actual behavior matches what's documented.
Based on my testing, it seems that empirically, at least on relatively modern Ruby versions, both String#inspect
and String#dump
seem produce output that, when eval
ed, equals the original (Unicode) string.
Specifically, using the following test string (which I believe contains all currently assigned non-surrogate Unicode characters, as well as a few extra potentially problematic character pairs and sequences),
unicode_points = (0..0xD7FF).to_a + (0xE000..0xE007F).to_a
str = unicode_points.map { |i| i.chr(Encoding::UTF_8) }.join("")
str += "\#{foo} \\\\ \\\' \\\" \r\n\t"
it seems that both eval(str.inspect) == str
and eval(str.dump) == str
evaluate to true on both CRuby 2.6.10 and 3.3.0dev and JRuby 9.3.10.0 (which are what I happen to have installed and conveniently available).
The gsub
method in my bonus question #3, however, does not quite work; the problematic character sequence is "\r\n"
(i.e. ASCII CR+LF), which apparently gets collapsed into a single LF even inside single quoted strings. Specifically, it turns out that eval("'\r\n'") == "\n"
(!).
(I discovered this based on a warning saying warning: encountered \r in middle of line, treated as a mere space
that I got when testing with the string containing all Unicode characters. This led me to suspect that there might be some funny parsing going on with newlines, so I added "\r\n"
to my test string and got a mismatch.)
Also, while testing String#dump
, I happened to notice that the test string above fails to round-trip properly with String#undump
. A simpler test case demonstrating the same issue is e.g. str = "\u0001\uABCD"
, for which str.dump.undump
raises RuntimeError: hex escape and Unicode escape are mixed
.
Apparently the problem is that String#dump
encodes characters in the ASCII C0 control codes as hex escape codes of the form \xNN
, but non-ASCII Unicode characters above U+007F in the form \uNNNN
(or \u{NNNNN}
for characters outside the BMP), which String#undump
for some reason does not like. While this is not a problem with eval()
, which seems to happily accept the output of String#dump
, it probably still counts as a bug. I have now reported it as https://bugs.ruby-lang.org/issues/19558.