Search code examples
linuxencodingutf-8iconv

Confirming the encoding of a file


I am outputting a file from SSIS in UTF-8 Encoding. This file is passed to a third party for import into their system. They are having a problem importing this file. Although they requested UTF-8 encoding, it seems they convert the encoding to ISO-8859-1. They use this command to convert the files encoding:

iconv -f UTF-8 -t ISO-8859-1 dweyr.inp 

They are receiving this error

illegal input sequence at position 11 

The piece of text causing the issue is:

ark O’Dwy

I think its the apostrophe, or whatever version of an apostrophe is used in this text. The problem i face is that every text editor i try tells me the file is UTF-8 and renders it correctly. The vendor is saying that this char is not UTF-8.

How can i confirm whom is correct?


Solution

  • The error message by iconv is a bit misleading, but kind-of correct.

    It doesn't tell you that the input isn't valid UTF-8, but that it cannot be converted to ISO-8859-1 in a lossless way. ISO-8859-1 does not have a way to encode the character.

    Verify that by executing this command:

    echo "ark O’Dwy" | iconv -f UTF-8 -t UTF-7
    

    This produces the output that looks like "ark O+IBk-Dwy".

    Here I'm outputting to UTF-7 (a very rarely used encoding that is useful for demonstration here, but little else).

    In other words: the encoding is only "illegal" in the sense that it cannot be converted to ISO-8859-1, but it's a perfectly valid UTF-8 sequence.

    If the third party claims to support UTF-8, then they may do so only very superficially. They might support any text that can be encoded in ISO-8859-1 as long as it's encoded in UTF-8 (which is an extremely low level of "UTF-8 support").