Search code examples
rubyencodingutf-8windows-1252

Encode while preserving undefined characters


Here I got a string from outside db ãƒ\u008F, and I want to convert it back to unicode character. I know the db is using windows-1252 encoding, so the actual character should be \xe3\x83\x8f, which is in utf-8 encoding.

Here are the things I've tried so far

"ãƒ\u008F".encode('windows-1252')
# => Encoding::UndefinedConversionError: U+008F to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252

"ãƒ\u008F".encode('windows-1252', undef: :replace)
# => "\xE3\x83?"

This is reasonable, since 0x8f is undefined in windows-1252's codepage.

----------Windows-1252-----------
  0 1 2 3 4 5 6 7 8 9 a b c d e f
2   ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ 
8 € � ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ � Ž � <---right here!
9 � ‘ ’ “ ” • – — ˜ ™ š › œ � ž Ÿ
a   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯
b ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
c À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
d Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
e à á â ã ä å æ ç è é ê ë ì í î ï
f ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

My question is, how can I encode while preserving the undefined character? Namely, how can I get

s = "ãƒ\u008F".some_magic_methods
# => "\xE3\x83\x8F"

s.force_encoding('utf-8')
# => "ハ"

Solution

  • I think I have a vague idea of what's going on here, but I'm having trouble formulating a proper explanation. Nevertheless, here's a solution that at least works for your one example:

    str = "ãƒ\u008F"
    str2 = str.chars.map {|c| c.encode('windows-1252').ord rescue c.ord }
             .pack('C*').force_encoding('utf-8')
    puts str2
    # => ハ
    

    Of course, this is pretty inefficient for large texts, but hopefully it'll help. If I have the wherewithal later on I'll come back and try to add a better explanation.