Search code examples
rubyunicodehexsipoctal

Octal, Hex, Unicode


I have a character appearing over the wire that has a hex value and octal value \xb1 and \261.

This is what my header looks like:

From: "\261Central Station <sip@...>"

Looking at the ASCII table the character in the picture is "±":

enter image description here

What I don't understand:

  1. If I try to test the same by passing "±Central Station" in the header I see it converted to "\xC2\xB1". Why?
  2. How can I have "\xB1" or "\261" appearing over the wire instead of "\xC2\xB1". e. If I try to print "\xB1" or "\261" I never see "±" being printed. But if I print "\u00b1" it prints the desired character, I'm assuming because "\u00b1" is the Unicode format.

Solution

  • From the page you linked to:

    The extended ASCII codes (character code 128-255)

    There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1.

    That's worth reading twice. The character codes 128–255 aren't ASCII (ASCII is a 7-bit encoding and ends at 127).

    Assuming that you're correct that the character in question is ± (it's likely, but not guaranteed), your text could be encoded ISO 8850-1 or, as @muistooshort kindly pointed out in the comments, any of a number of other ISO 8859-X or CP-12XX (Windows-12XX) encodings. We do know, however, that the text isn't (valid) UTF-8, because 0xb1 on its own isn't a valid UTF-8 character.

    If you're lucky, whatever client is sending this text specified the encoding in the Content-Type header.

    As to your questions:

    1. If I try to test the same by passing ±Central Station in header I see it get converted to \xC2\xB1. Why?

    The text you're passing is in UTF-8, and the bytes that represent ± in UTF-8 are 0xC2 0xB1.

    1. How can I have \xB1 or \261 appearing over the wire instead of \xC2\xB1?

    We have no idea how you're testing this, so we can't answer this question. In general, though: Either send the text encoded as ISO 8859-1 (Encoding::ISO_8859_1 in Ruby), or whatever encoding the original text was in, or as raw bytes (Encoding::ASCII_8BIT or Encoding::BINARY, which are aliases for each other).

    1. If I try to print \xB1 or \261 I never see ± being printed. But if I print \u00b1 it prints the desired character. (I'm assuming because \u00b1 is the unicode format but I will love If some can explain this in detail.)

    That's not a question, but the reason is that \xB1 (\261) is not a valid UTF-8 character. Some interfaces will print for invalid characters; others will simply elide them. \u00b1, on the other hand, is a valid Unicode code point, which Ruby knows how to represent in UTF-8.

    Brief aside: UTF-8 (like UTF-16 and UTF-32) is a character encoding specified by the Unicode standard. U+00B1 is the Unicode code point for ±, and 0xC2 0xB1 are the bytes that represent that code point in UTF-8. In Ruby we can represent UTF-8 characters using either the Unicode code point (\u00b1) or the UTF-8 bytes (in hex: \xC2\xB1; or octal: \302\261, although I don't recommend the latter since fewer Rubyists are familiar with it).

    Character encoding is a big topic, well beyond the scope of a Stack Overflow answer. For a good primer, read Joel Spolsky's "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", and for more details on how character encoding works in Ruby read Yehuda Katz's "Encodings, Unabridged". Reading both will take you less than 30 minutes and will save you hundreds of hours of pain in the future.