Search code examples
ruby-on-railsrubyencodingbyte-order-mark

Ruby: Check for Byte Order Marker


In Rails, we are some text files as ISO-8859-1. Sometimes the files come in as UTF-8 with BOM. I am trying to determine if its UTF-8 with BMO then re-read the file as bom|UTF-8.

I trying the following but it doesn't seem to compare correctly:

# file is saved as UTF-8 with BOM using Sublime Text 2

> string = File.read(file, encoding: 'ISO-8859-1')

# this doesn't work, while it supposed to work
> string.start_with?("\xef\xbb\xbf".force_encoding("UTF-8"))
> false

# it works if I try this
> string.start_with?('')
> true

The purpose is to read the file as UTF-8 with BOM if file has the Byte Order Marker at the start and I want to avoid string.start_with?('') method.


Solution

  • string.start_with?("\u00ef\u00bb\u00bf")
    

    From Ruby official documentation:

    \xnn      hexadecimal bit pattern, where nn is 1-2 hexadecimal digits ([0-9a-fA-F])

    \unnnn  Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])

    That said, to interpolate a unicode character, one should use \uXXXX notation. It is safe and we can reliable use this version.