Search code examples
c#utf-8winsockutf8-decode

UTF8 Byte to String & Winsock GetStream


Well, I'm trying to convert a large information in bytes for string. (11076 length)

The problem in the end, the information is with missing characters. (length 10996)

Look:

enter image description here

The information is received by Winsock connection, look the proccess:

    public static void UpdateClient(UserConnection client)
    {
        string data = null;
        Decoder utf8Decoder = Encoding.UTF8.GetDecoder();

            Console.WriteLine("Iniciando");
            byte[] buffer = ReadFully(client.TCPClient.GetStream(), 0);
            int charCount = utf8Decoder.GetCharCount(buffer, 0, buffer.Length);
            Char[] chars = new Char[charCount];
            int charsDecodedCount = utf8Decoder.GetChars(buffer, 0, buffer.Length, chars, 0);

            foreach (Char c in chars)
            {
                data = data + String.Format("{0}", c);
            }

            int buffersize = buffer.Length;
            Console.WriteLine("Chars is: " + chars.Length);
            Console.WriteLine("Data is: " + data);
            Console.WriteLine("Byte is: " + buffer.Length);
            Console.WriteLine("Size is: " + data.Length);
            Server.Network.ReceiveData.SelectPacket(client.Index, data);
    }

    public static byte[] ReadFully(Stream stream, int initialLength)
    {
        if (initialLength < 1)
        {
            initialLength = 32768;
        }

        byte[] buffer = new byte[initialLength];
        int read = 0;

        int chunk;

        chunk = stream.Read(buffer, read, buffer.Length - read);

        checkreach:
            read += chunk;

            if (read == buffer.Length)
            {
                int nextByte = stream.ReadByte();

                if (nextByte == -1)
                {
                    return buffer;
                }

                byte[] newBuffer = new byte[buffer.Length * 2];
                Array.Copy(buffer, newBuffer, buffer.Length);
                newBuffer[read] = (byte)nextByte;
                buffer = newBuffer;
                read++;
                goto checkreach;
            }

        byte[] ret = new byte[read];
        Array.Copy(buffer, ret, read);
        return ret;
    }

Anyone have tips or a solution?


Solution

  • It's perfectly normal for UTF-8 encoded text to be more bytes than the number of characters. In UTF-8 some characters (for example á and ã) are encoded into two or more bytes.

    As the ReadFully method returns garbage if you try to use it to read more than fits in the initial buffer or if it can't read the entire stream with one Read call, you shouldn't use it. Also the way that the char array is converted to a string is extremely slow. Just use a StreamReader to read the stream and decode it to a string:

    public static void UpdateClient(UserConnection client) {
      string data;
      using (StreamReader reader = new StreamReader(client.TCPClient.GetStream(), Encoding.UTF8)) {
        data = reader.ReadToEnd();
      }
      Console.WriteLine("Data is: " + data);
      Console.WriteLine("Size is: " + data.Length);
      Server.Network.ReceiveData.SelectPacket(client.Index, data);
    }