Search code examples
rubyutf-8character-encodingweb-scrapingutf-16

adding backslash to fix character encoding in ruby string


I'm sure this is very easy but I'm getting tied in a knot with all these backslashes.

I have some data that I'm scraping (politely) from a website. Occasionally a sentence comes to me looking something like this:

u00a362 000? you must be joking

Which should of course be '£2 000? you must be joking'. A short test in irb deciphered it.

ruby-1.9.2-p180 :001 > string = "u00a3"
  => "u00a3" 
ruby-1.9.2-p180 :002 > string = "\u00a3"
  => "£" 

Of course: add a backslash and it will be decoded. I created the following with the help of this question:

puts str.gsub('u00', '\\u00') 

which resulted in \u00a3 being output. This is all well and good, but I want it to be £ in the string itself. just putsing it isn't enough.

It's no good doing gsub('u00a3', '£') as there will doubtless be other characters I'm missing.

thanks for any help.


Solution

  • Warning, the following is not really pretty.

    str = "u00a362 000? you must be joking"
    split_unicode = str.gsub(/(u00[a-z0-9]{2})/, "split_here\\1split_here").split(/split_here/)
    final = split_unicode.map do |elem|
      if elem =~ /^u00/
        [("0x" + elem.gsub(/u00/, '')).hex].pack("U*")
      else
        elem
      end
    end
    puts final.join
    

    So the idea here is to find u00xx values and convert them to hex. From there, we can use the pack method to output the right unicode characters.

    It can also be crunched in an horrible one-liner!

    puts (str.gsub(/(u00[a-z0-9]{2})/, "split_here\\1split_here").split(/split_here/).map {|elem| elem =~ /^u00/ ? [("0x" + elem.gsub(/u00/, '')).hex].pack("U*") : elem}).join
    

    There might be a better solution (I hope!) but this one works.