I have a MySQL dump, which is not valid UTF-8. Two questions:
Edit to add specific data which isn't valid UTF-8:
uconv -f utf8 a.sql -o /dev/null
Conversion to Unicode from codepage failed at input byte position XXX. Bytes: ed Error: Illegal character found
Here is a hex sample.
xxd -s {XXX-16} -l 30 a.sql
YYY: 6e2e 203c 2f70 3e20 cfa1 ecaf a6eb 9ea0 n. </p> ........
aabb aabb aabb aabb aaaa bbbb bbaa aaaa
YYZ: edb6 b0e1 aea5 ee9e a027 2c27 3230 .........','20
^^^^ ^^
Edit 2: Added more context above. Also looks like the problem sequence meets UTF-8 format, it just maps to U+1DDB0 which doesn't exist.
I "solved" this using:
uconv --from-code utf8 --from-callback substitute --to-code utf8 a.sql -o a.sql.fixed
Which simply substitutes a default character for any invalid sequences. man unconv
shows several other options, like escaping or deleting invalid sequences.
In my case there only appeared to be a very small number of errors, so I was more interested in being able to process the dump, than correctly identifying what had happened in these cases.
Edit: By using --from-callback skip
I could count the number of invalid sequences, by comparing the length of the input and output files.