Search code examples
clinuxshellterminalsignals

Why unicode character is printing even using 1 byte to handling it


I am doing a school project and I came across something that shouldn't work in theory.

I need to create two programs where one communicates with the other through unix signals, I will call them client and server, I pass a message in my client's argv, break each char into bit and send to the server

The idea is to use bitwise communication (Something simple and rudimentary, if the bit is 0 I send SIGUSR1 to the server PID using the kill system call, if it is 1 I send SIGUSR2.

#client send a char to server
int send_sig(int pid, unsigned char b)
{
    int a;

    a = 0;
    while (a < 8)
    {
        if (b & 1)
            kill(pid, SIGUSR2);
        else
            kill(pid, SIGUSR1);
        b = b >> 1;
        a++;
        usleep(1000);
    }
    return (0);
}

the problem is when I use unicode characters, the argv will always be a string (an array of char) so when I pass some unicode character it will vary from 1 to 4 bytes, even so the process continues normal, the problem happens on my server side where I get these bits

The way I structured my code is that I need to print one bit at a time (which is acceptable since in theory a char in C is equivalent to one byte) but even when passing 4 byte unicode characters, printing them one at a time it keeps working (it's like Russian roulette, it breaks sometimes and works normally sometimes)

# Server receiving the 
unsigned char   reverse(unsigned char b)
{
    b = (b & 0xF0) >> 4 | (b & 0x0F) << 4;
    b = (b & 0xCC) >> 2 | (b & 0x33) << 2;
    b = (b & 0xAA) >> 1 | (b & 0x55) << 1;
    return (b);
}

void    signal_handler(int sig, siginfo_t *p_info, void *ucontext)
{
    static unsigned int     a = 0;
    static unsigned int     b = 0;

    a <<= 1;
    if (sig == SIGUSR2)
        a++;
    b++;
    if (b == 8)
    {
        b = 0;
        ft_printf("%c\0", reverse(a));
    }
    p_info = p_info;
    ucontext = ucontext;
}

Why this behavior happens ? wasn't it just for it to break and print something wrong ?

Expeculations:

  • the way I print on stdout without NULL byte make the shell and terminal interpreter a whole byte without losing the UTF-8 map

  • The unicode fitt in char (But this is impossible I guess)

reproduce this behavior with theses code:

#client.c file
#include <signal.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
void send_sig(int pid, char b)
{
    int a = 0;
    printf("%c", b);
    while (a < 8)
    {
        if (b & 1)
            kill(pid, SIGUSR2);
        else
            kill(pid, SIGUSR1);
        b >>= 1;
        a++;
        usleep(500);
    }
}
int main(int argc, char *argv[])
{
    char *s = "🤨🤨🤨🤨🤨🤨🤨";

    while (*s++ != '\0')
        send_sig(atoi(argv[1]), *s);

}
#server.c file
#include <unistd.h>
#include <stdio.h>
#include <signal.h>

unsigned char   reverse(unsigned char b)
{
    b = (b & 0xF0) >> 4 | (b & 0x0F) << 4;
    b = (b & 0xCC) >> 2 | (b & 0x33) << 2;
    b = (b & 0xAA) >> 1 | (b & 0x55) << 1;
    return (b);
}

void    signal_handler(int sig, siginfo_t *p_info, void *ucontext)
{
    static unsigned int     a = 0;
    static unsigned int     b = 0;

    a <<= 1;
    if (sig == SIGUSR2)
        a++;
    b++;
    if (b == 8)
    {
        b = 0;
        a = reverse(a);
        write(1, &a, 1);
    }
    p_info = p_info;
    ucontext = ucontext;
}

int main(void)
{
    struct sigaction    act;

    act.sa_sigaction = signal_handler;
    sigemptyset(&act.sa_mask);
    act.sa_flags = 0;
    sigaction(SIGUSR1, &act, NULL);
    sigaction(SIGUSR2, &act, NULL);
    printf("The server pid: %d\n", getpid());
    while (1)
        usleep(300);
}


Solution

  • Sending unicode bit by bit can be implemented either by sending the 16 (UTF-16) or 32 (UTF-32) bit value (that means a character transmission is always 16 or 32 bits long) or byte by byte. If latter, then the first byte determines the number of bytes (bits) in the transmission. Currently, your server reads only 8 bits and sends the received byte to output, the rest of the (possible multibyte character) bytes are not considered and discarded.

    If your server has the first byte (8-bits), then do the following to calculate the number of bytes in the transmission:

    if (byte < 0x80)
        num_bytes = 1; //single byte, no further read required
    else if ((byte & 0xe0) == 0xc0)
        num_bytes = 2; //one more byte to read
    else if ((byte & 0xf0) == 0xe0)
        num_bytes = 3; //two more bytes to read
    else if ((byte & 0xf8) == 0xf0)
        num_bytes = 4; //three more bytes to read
    

    Then, to form a valid utf8 (multibyte) character, read the following (if any) bytes into a char array, e.g. unsigned char utf8_bytes[4];

    Of course, in order to form a valid null-terminated (printable) string the size of the array has to be 5 and the last byte set to '\0'.


    Addition

    Your client is sending the bit-sequence (byte: 10101010) as follows:

    1010101|0 -> SIGUSR1
     101010|1 -> SIGUSR2
      10101|0 -> SIGUSR1
       1010|1 -> SIGUSR2
        101|0 -> SIGUSR1
         10|1 -> SIGUSR2
          1|0 -> SIGUSR1
           |1 -> SIGUSR2
    

    So, every time your server is receiving a SIGUSR2 it has to set the bit at a certain position, which can be easily done like this:

    if (sig == SIGUSR2)
        byte |= (1 << bit_counter);
    
    ++bit_counter;
    

    The complete server code could look like this:

    void signal_handler(int sig, siginfo_t *p_info, void *ucontext)
    {
        static unsigned char utf8_bytes[5]; //multibyte storage
        static unsigned char byte = 0; //bitset
        
        static int byte_index  = 0; //current position in the mb storage
        static int bit_counter = 0; //number of bits received
        static int num_bytes   = 1; //total number of bytes of mb character
        
        if (sig == SIGUSR2) //bit: 1
            byte |= (1 << bit_counter); //set the according bit in byte
            
        if (++bit_counter == 8) { //we received 8 bits -> 1 byte
        
            if (byte_index == 0) { //if first byte in sequence
                if (byte < 0x80)
                    num_bytes = 1; //single byte, no further read required
                else if ((byte & 0xe0) == 0xc0)
                    num_bytes = 2; //one more byte to read
                else if ((byte & 0xf0) == 0xe0)
                    num_bytes = 3; //two more bytes to read
                else if ((byte & 0xf8) == 0xf0)
                    num_bytes = 4; //three more bytes to read
            }
    
            //since we completed 1 byte, decrease num_bytes
            if (--num_bytes == 0) { //and if there are no more bytes to read
                utf8_bytes[++byte_index] = '\0'; //make null-terminated string
                //printf("%s\n", utf8_bytes); //do something useful
                byte_index = 0; //reset the byte index
            } else { //we need further reading
                utf8_bytes[byte_index++] = byte; //store the byte
            }
            
            bit_counter = 0; //reset counter
            byte        = 0; //reset byte (set all bits to zero)
    
        }
        
        p_info = p_info;
        ucontext = ucontext;
    }