Search code examples
rubycaesar-cipher

Ruby Cyphering Leads to non Alphanumeric Characters


I'm trying to make a basic cipher.

def caesar_crypto_encode(text, shift)  
  (text.nil? or text.strip.empty? ) ? "" : text.gsub(/[a-zA-Z]/){ |cstr| 
  ((cstr.ord)+shift).chr }
end

but when the shift is too high I get these kinds of characters:

  Test.assert_equals(caesar_crypto_encode("Hello world!", 127), "eBIIL TLOIA!")

  Expected: "eBIIL TLOIA!", instead got: "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"

What is this format?


Solution

  • The reason you get the verbose output is because Ruby is running with UTF-8 encoding, and your conversion has just produced gibberish characters (an invalid character sequence under UTF-8 encoding).

    ASCII characters A-Z are represented by decimal numbers (ordinals) 65-90, and a-z is 97-122. When you add 127 you push all the characters into 8-bit space, which makes them unrecognizable for proper UTF-8 encoding.

    That's why Ruby inspect outputs the encoded strings in quoted form, which shows each character as its hexadecimal number "\xC7...".

    If you want to get some semblance of characters out of this, you could re-encode the gibberish into ISO8859-1, which supports 8-bit characters.

    Here's what you get if you do that:

    s = "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
    >> s.encoding
    => #<Encoding:UTF-8>
    
    # Re-encode as ISO8859-1.
    # Your terminal (and Ruby) is using UTF-8, so Ruby will refuse to print these yet.
    >> s.force_encoding('iso8859-1')
    => "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
    
    # In order to be able to print ISO8859-1 on an UTF-8 terminal, you have to 
    # convert them back to UTF-8 by re-encoding. This way your terminal (and Ruby)
    # can display the ISO8859-1 8-bit characters using UTF-8 encoding:
    >> s.encode('UTF-8')
    => "Çäëëî öîñëã!"
    
    # Another way is just to repack the bytes into UTF-8:
    >> s.bytes.pack('U*')
    => "Çäëëî öîñëã!"
    

    Of course the proper way to do this, is not to let the numbers overflow into 8-bit space under any circumstance. Your encryption algorithm has a bug, and you need to ensure that the output is in the 7-bit ASCII range.

    A better solution

    Like @tadman suggested, you could use tr instead:

    AZ_SEQUENCE = *'A'..'Z' + *'a'..'z'
    
    "Hello world!".tr(AZ_SEQUENCE.join, AZ_SEQUENCE.rotate(127).join)
    => "eBIIL tLOIA!