Search code examples
javaencryptionencoding

How to detect encoding mismatch


I have a bunch of old AES-encrypted Strings encrypted roughly like this:

  1. String is converted to bytes with ISO-8859-1 encoding
  2. Bytes are encrypted with AES
  3. Result is converted to BASE64 encoded char array

Now I would like to change the encoding to UTF8 for new values (eg. '€' does not work with ISO-8859-1). This will of course cause problems if I try to decrypt the old ISO-8859-1 encoded values with UTF-8 encoding:

org.junit.ComparisonFailure: expected:<!#[¤%&/()=?^*ÄÖÖÅ_:;>½§@${[]}<|'äöå-.,+´¨]'-Lorem ipsum dolor ...> but was:<!#[�%&/()=?^*����_:;>��@${[]}<|'���-.,+��]'-Lorem ipsum dolor ...>

I'm thinking of creating some automatic encoding fallback for this.

So the main question would be that is it enough to inspect the decrypted char array for '�' characters to figure out encoding mismatch? And what is the 'correct' way to declare that '�' symbol when comparing?

if (new String(utf8decryptedCharArray).contains("�")) {
    // Revert to doing the decrypting with ISO-8859-1
    decryptAsISO...
}

Solution

  • When decrypting, you get back the original byte sequence (result of your step 1), and then you can only guess whether these bytes denote characters according to the ISO-8859-1 or the UTF-8 encoding.

    From a byte sequence, there's no way to clearly tell how it is to be interpreted.

    A few ideas:

    • You could migrate all the old encrypted strings (decrypt, decode to string using ISO-8859-1, encode to byte array using UTF-8, encrypt). Then the problem is solved once and forever.
    • You could try to decode the byte array in both versions, see if one version is illegal, or if both versions are equal, and if it still is ambiguous, take the one with higher probability according to expected characters. I wouldn't recommend to go that way, as it needs a lot of work and still there's some probability of error.
    • For the new entries, you could prepend the string / byte sequence by some marker that doesn't appear in ISO-8859-1 text. E.g. some people follow the convention to prepend a Byte Order Marker at the beginning of UTF-8 encoded files. Although the resulting bytes (EF BB BF) aren't strictly illegal in ISO-8859-1 (being read as ), they are highly unlikely. Then, when your decrypted bytes start with EF BB BF, decode to string using UTF-8, otherwise using ISO-8859-1. Still, there's a non-zero probability of error.

    If ever possible, I'd go for migrating the existing entries. Otherwise, you'll have to carry on with "old-format compatibility stuff" in your code base forever, and still can't absolutely guarantee correct behaviour.