Search code examples
rubycsvruby-csv

Ruby `CSV.read` error invalid byte sequence in UTF-8 (ArgumentError)


First of all This is not a duplicate of this SO question here .I have a csv file encoded in Shift-JIS this is my script to parse the file

require 'csv'
str1 = '社員番号'
str2 = 'メールアドレス'
str1.force_encoding("Shift_JIS").encode!
str2.force_encoding("Shift_JIS").encode!
file=File.open("SyainInfo.csv", "r:Shift_JIS")
csv = CSV.read(file, headers: true)
p csv[str1]
p csv [str2]

but even after specifying enconding, I am getting invalid byte sequence in UTF-8 (ArgumentError) . Any thoughts? My ruby is 2.3.0


Solution

  • First of all, your encoding doesn't look right:

    '社員番号'.force_encoding("Shift_JIS").encode!
    #=> "\x{E7A4}\xBE\x{E593}\xA1\x{E795}\xAA\x{E58F}\xB7"
    

    force_encoding takes the bytes from str1 and interprets them as Shift JIS, whereas you probably want to convert the string to Shift JIS:

    '社員番号'.encode('Shift_JIS')
    #=> "\x{8ED0}\x{88F5}\x{94D4}\x{8D86}"
    

    Next, you can pass a filename to CSV.read, so instead of:

    file = File.open(filename)
    CSV.read(file)
    

    You can just write:

    CSV.read(filename)
    

    That said, you could either work with Shift JIS encoded strings:

    require 'csv'
    str1 = '社員番号'.encode("Shift_JIS")
    str2 = 'メールアドレス'.encode("Shift_JIS")
    csv = CSV.read('SyainInfo.csv', encoding: 'Shift_JIS', headers: true)
    csv[str1]
    csv[str2]
    

    Or – and that's what I would do – you could work with UTF-8 strings by specifying a second encoding:

    require 'csv'
    str1 = '社員番号'
    str2 = 'メールアドレス'
    csv = CSV.read('SyainInfo.csv', encoding: 'Shift_JIS:UTF-8', headers: true)
    csv[str1]
    csv[str2]
    

    encoding: 'Shift_JIS:UTF-8' instructs CSV to read Shift JIS data and transcode it to UTF-8. It's equivalent to passing 'r:Shift_JIS:UTF-8' to File.open