Search code examples
rubyfileunicodebyte-order-mark

How to avoid tripping over UTF-8 BOM when reading files


I'm consuming a data feed that has recently added a Unicode BOM header (U+FEFF), and my rake task is now messed up by it.

I can skip the first 3 bytes with file.gets[3..-1] but is there a more elegant way to read files in Ruby which can handle this correctly, whether a BOM is present or not?


Solution

  • With ruby 1.9.2 you can use the mode r:bom|utf-8

    text_without_bom = nil #define the variable outside the block to keep the data
    File.open('file.txt', "r:bom|utf-8"){|file|
      text_without_bom = file.read
    }
    

    or

    text_without_bom = File.read('file.txt', encoding: 'bom|utf-8')
    

    or

    text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')
    

    It doesn't matter, if the BOM is available in the file or not.


    You may also use the encoding option with other commands:

    text_without_bom = File.readlines(@filename, "r:utf-8")
    

    (You get an array with all lines).

    Or with CSV:

    require 'csv'
    CSV.open(@filename, 'r:bom|utf-8'){|csv|
      csv.each{ |row| p row }
    }