Search code examples
c#.nettextunicodeucs2

Dealing with unusual responses to text messages


I've written an appointment scheduling system which (among other things) sends out a reminder SMS the day before an appointment is due. It asks the user to confirm their attendance at the appointment by replying "OK" to the text.

Where people do reply it generally works well and has cut out a huge manual workload. I'm now in the process of tidying up a couple of defects (thankfully they're few and of low impact) but occasionally I see responses of @u{some string}. I don't have rules to parse this so they go into an invalid responses bucket for manual follow-up.

Today I saw a response that looked as follows:

@u004f006b

I'm pretty sure at this stage that the @u denotes that what follows is Unicode (similar to the \u designator in C#) so making that assumption I get the following:

U+004F => decimal 79 => O (uppercase)

U+006B => decimal 107 => k (lowercase)

The company that's responsible tell me that the message is hitting their servers like that so it must be a client issue right? I've looked in my SMS sending app (ChompSMS on Android 7.x) and can't see anything that'd set it to explicitly send it in Unicode vs ASCII, so I'm wondering how this happens?

I pulled 10 random responses that began with this Unicode designator out of the database and had a go at writing something to deal with them. What follows is my naïve attempt at this:

using System;
using System.Text;

namespace CharConversion
{
    class Program
    {
        static void Main()
        {
            string[] unicodeResponses = new string[]
            {
                "@U00430061006e20190074002000620065002000610062006c006500200074006f002000620065002000740068006500720065",
                "@U004f006b002000bf00bf",
                "@U004f006b002000bf00bf",
                "@U004f004b002000bf00bf",
                "@U004f006b002000bf00bf",
                "@U00d2006b",
                "@U004f004b",
                "@U004f006b00610079002000bf00bf0020",
                "@U004f004b",
                "@U004f006b00bf00bf00bffffd"
            };

            foreach (string unicodeResponse in unicodeResponses)
            {
                string characters2 = UnicodeCodePointsToString(unicodeResponse);
                Console.WriteLine("'{0}' is '{1}' in plain text", unicodeResponse, characters2);
            }

            Console.Read();
        }

        private static string UnicodeCodePointsToString(string unicodeResponse)
        {
            string[] characterByteValues = SplitStringEveryN(unicodeResponse.Substring(2), 4);
            char[] characters = new char[characterByteValues.Length];

            for (int i = 0; i < characterByteValues.Length; i++)
            {
                int ordinal = Int32.Parse(characterByteValues[i], System.Globalization.NumberStyles.HexNumber);
                characters[i] = (char) ordinal;
            }

            return new string(characters);
        }

        private static string[] SplitStringEveryN(string input, int splitLength)
        {
            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < input.Length; i++)
            {
                if (i % splitLength == 0)
                {
                    sb.Append(' ');
                }
                sb.Append(input[i]);
            }

            string[] returnValue = sb.ToString().TrimStart().Split(' ');
            return returnValue;
        }
    }
}

My questions:

  1. Why is this happening in the first place?

  2. With the code - is there anything I'm missing here? E.g. is there something in the Framework that can already handle this for me, or is there some glaring shortcoming that People Who Know All About Unicode can see? Is there something I can do better?

  3. Some of the code points still render as upside-down questions (I suspect myself that these are emojis) - is there any way I can handle them?

EDIT 2018-04-26 A note for posterity

(I was going to put this in a comment but it looked awful no matter what I did with it)

I had a look at the link in the accepted answer, and while the code is more concise than mine, the output at the end is identical - including the inverted question marks (and the glyphs I suspect are emojis). Some more reading on the differences between Unicode and UCS2 can be found here and the Wikipedia article is worth a read as well:

TL;DR

  • UCS-2 is obsolete and has since been replaced with UTF-16 UCS-2 is a fixed width encoding scheme while UTF-16 is a variable width encoding scheme
  • UTF-16 capable applications can read UCS-2 files but not the other way around
  • UTF-16 supports right to left scripts while UCS-2 does not
  • UTF-16 supports normalization while UCS-2 does not

Solution

  • SMS message can be encoded with several encodings. Those include 7-bit (GSM-7), 8-bit and 16-bit (UCS2). While most SMS programs encode message in the least wasteful encoding - there is nothing invalid in using 16-bit one even if all characters fall into the range of other encodings. That's I assume what happens in your case. Of course sms messages are transferred as bytes, not as u004f006b strings, so why it is represented like that is a matter of the tools you use \ third parties you work with.

    As for your parsing code. It assumes that string is in UTF-16 (internal representation of C# string), but if the above is correct, encoding is UCS2. It's very similar to UTF-16, but not exactly the same. I'm not quite qualified to discuss differences, but you can look at for example this answer for some clues about how you can work with it. That also might be the reason why some characters are decoded incorrectly.