Search code examples
rubystringescapingeval

Does Ruby guarantee that eval(str.inspect) == str?


I recently found myself needing to generate a simple Ruby script based on user input, some of which needs to be included in the script as a string literal. While in my specific case this input comes from a trusted source, I would still like to do this in a way that will not break even if the input string happens to contain e.g. quotes, backslashes, newlines, hash marks or other unexpected metacharacters.

The obvious solution (as recommended in the accepted answer to this earlier question) would be to use the String#inspect method, whose documentation says that it:

Returns a printable version of str, surrounded by quote marks, with special characters escaped.

The documentation, however, stops just short of explicitly stating that evaluating the output of String#inspect as Ruby code will return the original string. And, in fact, I did technically manage to come up with a counterexample using non-Unicode strings:

pry(main)> str = 0x80.chr; eval(str.inspect) == str
=> false

However, all the strings I need to encode are Unicode strings, so this counterexample is of only theoretical interest to me. But I'd still like some documented guarantee, hence the following questions:

  1. Is eval(str.inspect) guaranteed to be equal to str, if str is a Unicode string?
  2. If not, is there some other method of escaping a string literal in generated Ruby code that is guaranteed to always work?

Also, a bonus question:

  1. Is eval("'" + str.gsub(/[\\']/, { "\\" => "\\\\", "'" => "\\'" }) + "'") always guaranteed to equal str?

Solution

  • Let me try to summarize the results of my investigation so far (including Max's now-deleted answer that introduced me to String#dump):


    The documentation for String#inspect does not guarantee that evaling its output yields the original string. However, at least as of Ruby 3.0.2, the documentation for String#dump does make that guarantee:

    This method can be used for round-trip: if the resulting new_str is eval'ed, it will produce the original string.

    Thus, it appears that the answers to my questions #1 and #2 are:

    1. No, eval(str.inspect) is not guaranteed to equal str by the Ruby documentation (although in practice it does seem to work; see below).

    2. OTOH, eval(str.dump) is documented to always equal str.


    Of course, while documentation is nice to have, it's also a good idea to make sure that the actual behavior matches what's documented.

    Based on my testing, it seems that empirically, at least on relatively modern Ruby versions, both String#inspect and String#dump seem produce output that, when evaled, equals the original (Unicode) string.

    Specifically, using the following test string (which I believe contains all currently assigned non-surrogate Unicode characters, as well as a few extra potentially problematic character pairs and sequences),

    unicode_points = (0..0xD7FF).to_a + (0xE000..0xE007F).to_a
    str = unicode_points.map { |i| i.chr(Encoding::UTF_8) }.join("")
    str += "\#{foo} \\\\ \\\' \\\" \r\n\t"
    

    it seems that both eval(str.inspect) == str and eval(str.dump) == str evaluate to true on both CRuby 2.6.10 and 3.3.0dev and JRuby 9.3.10.0 (which are what I happen to have installed and conveniently available).


    The gsub method in my bonus question #3, however, does not quite work; the problematic character sequence is "\r\n" (i.e. ASCII CR+LF), which apparently gets collapsed into a single LF even inside single quoted strings. Specifically, it turns out that eval("'\r\n'") == "\n"(!).

    (I discovered this based on a warning saying warning: encountered \r in middle of line, treated as a mere space that I got when testing with the string containing all Unicode characters. This led me to suspect that there might be some funny parsing going on with newlines, so I added "\r\n" to my test string and got a mismatch.)


    Also, while testing String#dump, I happened to notice that the test string above fails to round-trip properly with String#undump. A simpler test case demonstrating the same issue is e.g. str = "\u0001\uABCD", for which str.dump.undump raises RuntimeError: hex escape and Unicode escape are mixed.

    Apparently the problem is that String#dump encodes characters in the ASCII C0 control codes as hex escape codes of the form \xNN, but non-ASCII Unicode characters above U+007F in the form \uNNNN (or \u{NNNNN} for characters outside the BMP), which String#undump for some reason does not like. While this is not a problem with eval(), which seems to happily accept the output of String#dump, it probably still counts as a bug. I have now reported it as https://bugs.ruby-lang.org/issues/19558.