Search code examples
rubyfilebyte-order-markendianness

Ruby strptime not working while reading file


I have the following code:

require 'date'

f = File.open(filepath)

f.each_with_index do |line, i|
    a, b = line.split("\t")
    d = DateTime.strptime(a, '%m/%d/%Y %I:%M %p')
    puts "#{a} --- #{b}"
    break unless i < 100
end

And I'm getting the following error:

c_reader.rb:10:in `strptime': invalid date (ArgumentError)
  from c_reader.rb:10:in `block in <main>'
  from c_reader.rb:6:in `each'
  from c_reader.rb:6:in `each_with_index'
  from c_reader.rb:6:in `<main>'

The file content:

1/30/2014 1:00 AM   1251.6  
1/30/2014 2:00 AM   1248  
1/30/2014 3:00 AM   1246.32  
1/30/2014 4:00 AM   1242.96  
1/30/2014 5:00 AM   1282.08  
1/30/2014 6:00 AM   1293.84  
1/30/2014 7:00 AM   1307.04  
1/30/2014 8:00 AM   1337.76  
1/30/2014 9:00 AM   1357.92  

If I type this into IRB, it works perfect:

DateTime.strptime("1/30/2014 2:00 PM", '%m/%d/%Y %I:%M %p')

can someone please tell me what's going on here?


Solution

  • Your example data wasn't matching what your code was trying to process so I adjusted that for this. Plus, it needed something to show the AM/PM was being honored.

    With those tweaks to the data, your code works fine. strptime is returning valid DateTime objects.

    require 'date'
    
    [
      "1/30/2014 1:00 AM\t1251.6",
      "1/30/2014 2:00 AM\t1248",
      "1/30/2014 3:00 PM\t1246.32",
      "1/30/2014 4:00 PM\t1242.96",
    ].each do |line|
      a, b = line.split("\t")
      puts DateTime.strptime(a, '%m/%d/%Y %I:%M %p')
    end
    # >> 2014-01-30T01:00:00+00:00
    # >> 2014-01-30T02:00:00+00:00
    # >> 2014-01-30T15:00:00+00:00
    # >> 2014-01-30T16:00:00+00:00
    

    Your data file has a BOM ("byte-order-mark"). The first two bytes indicate the "endianness" of the order of bytes in the file. In addition, each character actually occupies two bytes. This is a UTF-16LE file because fffe has a missing bit (0xfe == 0b11111110) signifying the end of the byte-pair is smaller than the first byte. If it was feff it'd be a "big-endian":

    0000000: fffe 3100 2f00 3300 3000 2f00 3200 3000  ..1./.3.0./.2.0.
    

    Ruby doesn't know what to do with those because it's expecting its default of UTF-8. To fix that you need to tell Ruby how to interpret that. Look at the documentation for IO.new to see how to define encodings. Ruby assumes data will be UTF-8, so the incoming data has to be converted from UTF-16LE to UTF-8. This is one way to do it:

    require 'date'
    
    File.open(
      "test.csv",
      "rb:BOM|UTF-16LE:UTF-8"
    ) do |fi|
      fi.each_with_index do |line, i|
        a, b = line.split("\t")
        d = DateTime.strptime(a, '%m/%d/%Y %I:%M %p')
        puts "#{ 1 + i } #{a} --- #{b}"
        break unless i < 100
      end
    end
    

    Running that outputs:

    1 1/30/2014 1:00 AM --- 1251.6
    2 1/30/2014 2:00 AM --- 1248
    3 1/30/2014 3:00 AM --- 1246.32
    4 1/30/2014 4:00 AM --- 1242.96
    5 1/30/2014 5:00 AM --- 1282.08
    6 1/30/2014 6:00 AM --- 1293.84
    7 1/30/2014 7:00 AM --- 1307.04
    8 1/30/2014 8:00 AM --- 1337.76
    9 1/30/2014 9:00 AM --- 1357.92