tcl

Confused about getting utf-8 text with special characters after decoding XOR frames?


I seem to be especially dense-headed when it comes to this topic. I have utf-8 text (JSON strings from the browser) that have em dashes and curly apostrophes in them; and I'd like to store that in a SQLite database such that one can search including the special characters; and, of course, send it back to the browser.

I was having issues with the special characters appearing incorrectly and, at the times when I did something that resulted in them appearing correctly, the lengths would be off such that the changes made in the UI were taking place at a different location in the text data.

I assume that part of the issue is that I do not understand what I have immediately after decoding using XOR. It's just set of integers that need to be converted back to a binary string using binary format ... and, then, that needs to be read as text for an op code of 1, if the data is to appear correctly in the database?

Is using encoding convertfrom utf-8 ... after decoding and encoding convertto utf-8 ... after extracting from the database and before sending over the socket the correct method?

I thought the browser was sending utf-8 JSON to start off with; so, it seems wrong to convert to/from utf-8; but this is the only way I've been able to get the database to store the characters properly and be able to send them back to the browser without a JSON parse error or some difference in string length.

Thank you for any guidance you may be able to provide.

I'm storing decoded in the database.

set raw_decoded {}
foreach b $enc {
  append raw_decoded \
     "[expr {$b ^ [lindex $mKey [expr {[incr offset] % 4}]]}] "
}
if { $op == 1 } {
  append decoded [encoding convertfrom utf-8\
     [binary format cu* $raw_decoded]]
}

And extracting that same data as response to send to the browser. And $sock is configured as binary.

  set response [encoding convertto utf-8 $response]
  set len [string length $response]
  if { $len > 65535 } {
    chan puts -nonewline $sock [binary format cu2Wu {129 127} $len]
  } elseif { $len > 125 } {
    chan puts -nonewline $sock [binary format cu2Su {129 126} $len]
  } elseif { $len > 0 } {
    chan puts -nonewline $sock [binary format cu2 [list 129 $len]]
  }
  chan puts -nonewline $sock $response
  chan flush $sock
}

Solution

  • The JSON string in the browser is conceptually a sequence of unicode characters, which need to be encoded in some fashion to a sequence of bytes to be transmitted from the browser to your backend over the websocket (I guess that is the communication channel based on your XOR unmasking step). That encoded string (probably UTF-8) is then masked and sent as a sequence of bytes. What goes into the XOR unmasking is that sequence of bytes, the XOR unmasks those bytes to recover the sequence of bytes that the JSON string was encoded to. Then the encoding of the string (UTF-8 in this case) needs to be interpreted to turn it back into a sequence of characters.

    That is - the browser does (assuming json contains the JSON string in the browser):

    message = xor_mask(utf8_encode(json))
    

    Then sends the bytes of message over the wire to you, so you need to invert the transform like:

    json = utf8_decode(xor_mask(message))
    

    To send strings over a socket will need to encode those strings in some way to represent the string characters in a sequence of bytes (which is what is conveyed to the receiver). If the -encoding of the channel you are writing to is utf-8, then the characters you write are converted to their byte sequences when you write to the socket. If the socket is binary, then the channel subsystem doesn't convert what you write, and you will need to explicitly encode the string using encoding convertto utf-8 if you want the characters in the string to be recoverable at the receiver.

    To hopefully make the difference between a string of characters, a UTF-8 byte sequence, and the XOR masked byte sequence clearer, consider the following, for the string aはb and mask 0b11111111111111111111111111111111 (all ones, to make it easy to mask by hand):

    |   a  |   は           |   b  | character
    | 0x61 | 0x306F         | 0x62 | unicode code point
    | 0x61 | 0xE3 0x81 0xAF | 0x62 | utf-8 encoded byte sequence
    | 0x9E | 0x1C 0x7E 0x50 | 0x9D | xor_masked byte sequence