Search code examples
pythonfilebytetext-filescorruption

Why a single byte corruption while manipulating with plain text files could happen?


I am really confused here, and I even cannot state the topic of the question more clearly. While manipulating with the plain text files I encountered a weird replacement of symbols (bytes).

For example, I had a file with about 20000 strings, one of which is: MIEPTLIRVGEAFYDITHLAPTRHTVPVLVRGNFAKVPVRISYTNHCYSRTPRAGEQVPTGHEIKDGAKLRMFCEQRHRLSSYLPQILIDLLQGETSLWQAAGGNFLQVELVDDVDGEPPTKIEYNVILRMERLKPEGDQKHIMIRVETAYPEDIEYDKPFRKKSYKVSRILAAKWEDRDHREPEPKPGKGKGKAKKK

I merge about 1000 of such files together just writing them one after another with Python (using simple open(filename) method). In the resulting file in the corresponding string I saw (while all other strings are fine): MIEPTLIRVGEAFYDITHLAPTRHTVPVLVRGNFAKVPVRISYTNHCYSRTPRAGEQVPTGHEIKDGAKLRMFCEQRHRLSSYLPQILIDLLQGETSLWQAAGGNFLQVELVDDVDGEPPTKIEYNVILRMERLKPEGDQKHIMIRVETAYPEDIEYDKPFRKKSЩKVSRILAAKWEDRDHREPEPKPGKGKGKAKKK

Thus, a replacement of "Y" (HEX 59) to "Щ" letter (HEX D9) happened (both letters are made bold above). If I do this procedure again, no replacement occur in this place, thus it is random (?). I also noticed the same kind of replacement happening with "P" (HEX 50) and russian "Р" letter (HEX D0) in other case. What unites these cases is that in both cases letters in a pair have the same number if we count from 0 and 128 position of the ASCII table: english P has position 80, and russian Р has position 128+80=208; letter Y has position 89 and letter "Щ" has position 128+89=217. I guess this is a kind of file corruption, but how and why does it happen? Any ideas?


Solution

  • I should have guessed it myself before even asking: actually it looks like a single bit flip which likely could occur randomly as an error of reading/writing to disk. If the very first bit in a byte coding a letter flips, the replacement becomes visible because the letter is no longer in the first 128 symbols of ASCII table and some software becomes cranky about it.

    "Y" = 01011001
    "Щ" = 11011001 
    
    "P" = 01010000 
    "Р" = 11010000