Search code examples
rubyencodingutf-8yamlbyte-order-mark

Loading data from YAML in ruby changing the encoding/byte structure of data?


I am trying to write a method to remove some blacklisted characters like bom characters using their UTF-8 values. I am successful to achieve this by creating a method in String class with the following logic,

  def remove_blacklist_utf_chars
    self.force_encoding("UTF-8").gsub!(config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "")
    self
  end

Now to make it useful across the applications and reusable I create a config in a yml file. The yml structure is something like,

:blacklist_utf_chars:
  :zero_width_space: '"\u{200b}"'

(Edit) Also as suggested by Drenmi this didn't work,

:blacklist_utf_chars:
  :zero_width_space: \u{200b}

The problem I am facing is that the method remove_blacklist_utf_chars does not work when I load the utf-encoding of blacklist characters from yml file But when I directly pass these in the method and not via the yml file the method works.

So basically
self.force_encoding("UTF-8").gsub!("\u{200b}".force_encoding("UTF-8"), "") -- works.

but,

self.force_encoding("UTF-8").gsub!(config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "") -- doesn't work.

I printed the value of config[:blacklist_utf_chars][:zero_width_space] and its equal to "\u{200b}"

I got this idea by referring: https://stackoverflow.com/a/5011768/2362505.

Now I am not sure how what exactly is happening when the blacklist chars list is loaded via yml in ruby code.

EDIT 2:

On further investigation I observed that there is an extra \ getting added while reading the hash from the yaml. So,

puts config[:blacklist_utf_chars][:zero_width_space].dump

prints:

"\\u{200b}"

But then if I just define the yaml as:

:blacklist_utf_chars:
  :zero_width_space: 200b

and do,

ch = "\u{#{config[:blacklist_utf_chars][:zero_width_space]}}"
self.force_encoding("UTF-8").gsub!(ch.force_encoding("UTF-8"), "")

I get

/Users/harshsingh/dir/to/code/utils.rb:121: invalid Unicode escape (SyntaxError)

Solution

  • The "\u{200b}" syntax is used for escaping Unicode characters in Ruby source code. It won’t work inside Yaml.

    The equivalent syntax for a Yaml document is the similar "\u200b" (which also happens to be valid in Ruby). Note the lack of braces ({}), and also the double quotes are required, otherwise it will be parsed as literal \u200b.

    So your Yaml file should look like this:

    :blacklist_utf_chars:
      :zero_width_space: "\u200b"