Here I got a string from outside db ãƒ\u008F
, and I want to convert it back to unicode character. I know the db is using windows-1252
encoding, so the actual character should be \xe3\x83\x8f
, which is ハ
in utf-8 encoding.
Here are the things I've tried so far
"ãƒ\u008F".encode('windows-1252')
# => Encoding::UndefinedConversionError: U+008F to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252
"ãƒ\u008F".encode('windows-1252', undef: :replace)
# => "\xE3\x83?"
This is reasonable, since 0x8f
is undefined in windows-1252
's codepage.
----------Windows-1252-----------
0 1 2 3 4 5 6 7 8 9 a b c d e f
2 ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~
8 € � ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ � Ž � <---right here!
9 � ‘ ’ “ ” • – — ˜ ™ š › œ � ž Ÿ
a ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯
b ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
c À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
d Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
e à á â ã ä å æ ç è é ê ë ì í î ï
f ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
My question is, how can I encode while preserving the undefined character? Namely, how can I get
s = "ãƒ\u008F".some_magic_methods
# => "\xE3\x83\x8F"
s.force_encoding('utf-8')
# => "ハ"
I think I have a vague idea of what's going on here, but I'm having trouble formulating a proper explanation. Nevertheless, here's a solution that at least works for your one example:
str = "ãƒ\u008F"
str2 = str.chars.map {|c| c.encode('windows-1252').ord rescue c.ord }
.pack('C*').force_encoding('utf-8')
puts str2
# => ハ
Of course, this is pretty inefficient for large texts, but hopefully it'll help. If I have the wherewithal later on I'll come back and try to add a better explanation.