I'm pulling some RSS feeds in from YouTube which have invalid UTF8. I can create a similar ruby string using
bad_utf8 = "\u{61B36}"
bad_utf8.encoding # => #<Encoding:UTF-8>
bad_utf8.valid_encoding? # => true
Ruby thinks this is a valid UTF-8 encoding and I'm pretty sure it isn't.
When talking to Mysql I get an error like so
require 'mysql2'
client = Mysql2::Client.new(:host => "localhost", :username => "root")
client.query("use test");
bad_utf8 = "\u{61B36}"
client.query("INSERT INTO utf8 VALUES ('#{moo}')")
# Incorrect string value: '\xF1\xA1\xAC\xB6' for column 'string' at row 1 (Mysql2::Error)
How can I detect or fix up these invalid types of encodings before I send them off to MySQL?
possibly because the code point doesn't lie in the basic multilingual plane which is the only characters that MySQL allows in its "utf8" character set.
Newer versions of mysql have another character set called "utf8mb4" which supports unicode characters outside the BMP.
But you probably don't want to be using that. Consider your use-cases carefully. Few real human languages (if any) use characters outside the BMP.