Search code examples
pythonutf-8character-encodingbytereed-solomon

'UTF-8' decoding error while using unireedsolomon package


I have been writing a code using the unireedsolomon package. The package adds parity bytes which are mostly extended ASCII characters. I am applying bit-level errors after converting the 'special character' parities using the following code:

def str_to_byte(padded):
    byte_array = padded.encode()
    binary_int = int.from_bytes(byte_array, "big")
    binary_string = bin(binary_int)
    without_b = binary_string[2:]
    return without_b

def byte_to_str(without_b):
    binary_int = int(without_b, 2)
    byte_number = binary_int.bit_length() + 7 // 8
    binary_array = binary_int.to_bytes(byte_number, "big")
    ascii_text = binary_array.decode()
    padded_char = ascii_text[:]
    return padded_char

After conversion from string to a bit-stream I try to apply errors randomly and there are instances when I am not able to retrieve those special-characters (or characters) back and I encounter the 'utf' error before I could even decode the message.

If I flip a bit or so it has to be inside the 255 ASCII character values but somehow I am getting errors. Is there any way to rectify this ?


Solution

  • It's a bit odd that the encryption package works with Unicode strings. Better to encrypt byte data since it may not be only text that is encrypted/decrypted. Also no need for working with actual binary strings (Unicode 1s and 0s). Flip bits in the byte strings.

    Below I've wrapped the encode/decode routines so they take either Unicode text and return byte strings or vice versa. There is also a corrupt function that will flip bits in the encoded result to see the error correction in action:

    import unireedsolomon as rs
    import random
    
    def corrupt(encoded):
        '''Flip up to 3 bits (might pick the same bit more than once).
        '''
        b = bytearray(encoded) # convert to writable bytes
        for _ in range(3):
            index = random.randrange(len(b)) # pick random byte
            bit = random.randrange(8)        # pic random bit
            b[index] ^= 1 << bit             # flip it
        return bytes(b) # back to read-only bytes, but not necessary
    
    def encode(coder,msg):
        '''Convert the msg to UTF-8-encoded bytes and encode with "coder".  Return as bytes.
        '''
        return coder.encode(msg.encode('utf8')).encode('latin1')
    
    def decode(coder,encoded):
        '''Decode the encoded message with "coder", convert result to bytes and decode UTF-8.
        '''
        return coder.decode(encoded)[0].encode('latin1').decode('utf8')
    
    coder = rs.RSCoder(20,13)
    msg = 'hello(你好)'  # 9 Unicode characters, but 13 (maximum) bytes when encoded to UTF-8.
    encoded = encode(coder,msg)
    print(encoded)
    corrupted = corrupt(encoded)
    print(corrupted)
    decoded = decode(coder,corrupted)
    print(decoded)
    

    Output. Note that the first l in hello (ASCII 0x6C) corrupted to 0xEC, then second l changed to an h (ASCII 0x68) and another byte changed from 0xE5 to 0xF5. You can actually randomly change any 3 bytes (not just bits) including error-correcting bytes and the message will still decode.

    b'hello(\xe4\xbd\xa0\xe5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
    b'he\xecho(\xe4\xbd\xa0\xf5\xa5\xbd)8\xe6\xd3+\xd4\x19\xb8'
    hello(你好)
    

    A note about .encode('latin1'): The encoder is using Unicode strings and the Unicode code points U+0000 to U+00FF. Because Latin-1 is the first 256 Unicode code points, the 'latin1' codec will convert a Unicode string made up of those code points 1:1 to their byte values, resulting in a byte string with values ranging from 0-255.