Are there hidden encoding errors that I need to fix in Latin 1 --> UTF-8?

Do I still need to run a full latin1 to UTF 8 conversion on the text that looks completely fine?

I'm swapping forum software, and the old forum database used Latin1 encoding. The new forum database uses UTF8 encoding for tables.

It looks like the importer script did a straight copy from one table to another without trying to fix any encoding issues.

I've been manually fixing the visible errors using a find-and-replace based on the conversion info listed here: http://www.i18nqa.com/debug/utf8-debug.html

The rest of the text looks fine and is completely readable.

My limited understanding is that UTF-8 is backwards compatible with ASCII and Latin1 is mostly ASCII, so it's only the edge cases that are different and need to be updated.

So do I still need to run a full latin1 to UTF 8 conversion on the text that looks completely fine?

I'd rather not because I've changed some of the BB Code tags on a number of the fields after they were stored in UTF 8, so concerned that those updates would have stuck UTF8 characters in the middle of the Latin1 characters, and trying to do a full conversion on mixed character sets will just muck things up further.

Solution

Any characters from ISO 8859-1 (Latin 1) in the range 0x80..0xFF need to be recoded as 2 bytes in UTF-8. The first byte is 0xC2 for 0x80..0xBF; the first byte is 0xC3 for 0xC0..0xFF. The second byte is derived from the original value from Latin 1 by setting the two most significant bits to 1 and 0. For the characters 0x80..0xBF, the value of the second byte is unchanged from Latin 1. If you were using 8859-15, you may have a few more complex conversions (the Euro symbol is encoded differently from other Latin 1 characters).

There are tools aplenty to assist. iconv is one such.