Search code examples
mysqlutf-8character-encodingutf8mb4

Invalid UTF-8 in SQL dump


I have a MySQL dump, which is not valid UTF-8. Two questions:

  1. Could this be caused by some of the database using utf8mb3 aka MySQL's 'utf8'? It definitely does use this encoding.
  2. If so, how can I fix it, without having access to MySQL to import, change table types, and re-export? Can I use any encoding conversion tool?

Edit to add specific data which isn't valid UTF-8:

uconv -f utf8 a.sql -o /dev/null

Conversion to Unicode from codepage failed at input byte position XXX. Bytes: ed Error: Illegal character found

Here is a hex sample.

xxd -s {XXX-16} -l 30 a.sql
YYY: 6e2e 203c 2f70 3e20 cfa1 ecaf a6eb 9ea0  n. </p> ........    
     aabb aabb aabb aabb aaaa bbbb bbaa aaaa
YYZ: edb6 b0e1 aea5 ee9e a027 2c27 3230       .........','20
     ^^^^ ^^

Edit 2: Added more context above. Also looks like the problem sequence meets UTF-8 format, it just maps to U+1DDB0 which doesn't exist.


Solution

  • I "solved" this using:

    uconv --from-code utf8 --from-callback substitute --to-code utf8 a.sql -o a.sql.fixed
    

    Which simply substitutes a default character for any invalid sequences. man unconv shows several other options, like escaping or deleting invalid sequences.

    In my case there only appeared to be a very small number of errors, so I was more interested in being able to process the dump, than correctly identifying what had happened in these cases.

    Edit: By using --from-callback skip I could count the number of invalid sequences, by comparing the length of the input and output files.