Search code examples
linuxc++17char16-t

How to read from the console to char16_t buffer


I work on Linux. I have to read from the console to char16_t buffer. Currently my code looks like this:

char tempBuf[1024] = {0};
int readBytes = read(STDIN_FILENO, tempBuf, 1024);
char16_t* buf = convertToChar16(tempBuf, readBytes); 

Inside the convert function I use mbrtoc16 std library function to convert each character separately. Is it the only way to read from the console to char16_t buf ? Do you know any alternative solution ?


Solution

  • Multi-byte Characters

    The main thing you want to be careful of reading into a fixed-length buffer is accidentally truncating "multi-byte characters" in your "multi-byte string"

    What is a multi-byte character you ask? In my environment they're UTF-8 characters. For example, if I run echo $LANG I get en_US.UTF-8. These are exactly what they sound like, they are characters that can be stored over multiple bytes. Anything other than the 7-bit ascii set is stored in 2 or more bytes that follow each other sequentially. If you read only part of the multi-byte character (truncating it) then you end up with garbage on both sides of the read.

    So let's see a concrete example:

    Example Code

    In the complete runnable file below, I purposefully shorten the buffer to only be 5 characters wide so I can easily hold a full 4-byte UTF-8 multi-byte character and a null terminator.

    #include <stdio.h>
    #include <unistd.h>
    #include <string.h>
    
    #define BUF_LEN 5
    
    int main()
    {
        /* you do your read assuming some byte length */
        char tempBuf[BUF_LEN] = {0};
        int readBytes = read(STDIN_FILENO, tempBuf, BUF_LEN);
    
        /* If you try to read from this tempBuffer with %s you'll overrun your
         * buffer since it doesn't have a null terminator, so we'll look at it
         * character by character */
        printf("Printing bytes:\n");
        for(size_t i = 0; i < readBytes; i++)
        {
            printf( "\t%zu) 0x%02x -- %c\n",
                    i,
                    (unsigned char)tempBuf[i], 
                    (unsigned char)tempBuf[i]);
            /* we cast the above to an unsigned char because the extra UTF
             * characters will start with a negative signed char and will not cast
             * correctly to an unsigned int to be used for reading hex values */
        }
    
        /* so what do we do if we identify a bad byte? we put it back into stdin */
        /* start at the end and search backward to find the most recent ascii
         * character */
        printf("\nlet's back up\n");
        char * p = &tempBuf[BUF_LEN - 1];
        while(((unsigned char)*p) > 127)
        {
            ungetc((unsigned char)*(p--), stdin);
        }
        printf("try again on that character\n");
        memset(tempBuf, 0, BUF_LEN); // set the buffer to zero again so what we 
                                     // read makes sense
        fgets(tempBuf, BUF_LEN, stdin);
        printf("Printing bytes again:\n");
        for(size_t i = 0; i < readBytes; i++)
        {
            printf( "\t%zu) 0x%02x -- %c\n",
                    i,
                    (unsigned char)tempBuf[i], 
                    (unsigned char)tempBuf[i]);
            /* we cast the above to an unsigned char because the extra UTF
             * characters will start with a negative signed char and will not cast
             * correctly to an unsigned int to be used for reading hex values */
        }
        printf("Multi-byte string all at once: \"%s\"", tempBuf);
        
        return 0;
    }
    

    Running an example

    Taking the above code I can construct an input that I know will break (truncate) a character on purpose, like so, to see what is going on.

    scott@scott-G3:~/tmp$ g++ -o stackoverflow_example stackoverflow_example.cpp 
    scott@scott-G3:~/tmp$ ./stackoverflow_example 
    abcdé
    Printing bytes:
        0) 0x61 -- a
        1) 0x62 -- b
        2) 0x63 -- c
        3) 0x64 -- d
        4) 0xc3 -- �
    
    let's back up
    try again on that character
    Printing bytes again:
        0) 0xc3 -- �
        1) 0xa9 -- �
        2) 0x0a -- 
    
        3) 0x00 -- 
        4) 0x00 -- 
    Multi-byte string all at once: "é
    

    So what happened?

    In the example above, I purposefully positioned the UTF-8 character "é", which expands to two bytes 0xC3, 0xA9 such that it would get cut off by your read call. I then used ungetc to put 0xC3 back into stdin, and read it again with it's partner 0xA9. Only when they're next to each other do they make any sense. You see an 0x0a following it which we know and love as '\n' because the read captured my return as well.