Search code examples
rubycharacter-encodingruby-csv

Can't read file charset utf-16le except puts in ruby


I need to read an external file in ruby. Running file -i locally shows text/plain; charset=utf-16le

I open it in ruby CSV with separater '\t' and a row shows as: <CSV::Row "\xFF\xFEC\x00a\x00n\x00d\x00i\x00d\x00a\x00t\x00e\x00 \x00n\x00u\...

row.to_s produces \x000\x000\x000\x001\x00\t\x00E\x00D\x00O

Running puts row shows the data correctly: 0001 EDOARDO A... (the values also show legibly in vim and LibreOffice Calc)

Any suggestions how to get the data in ruby? I've tried various combinations of opening the CSV with external_encoding: 'utf-16le', internal_encoding: "utf-8" etc., but puts is the only thing that gives legible values

It also said ASCII-8BIT in ruby CSV. <#CSV io_type:StringIO encoding:ASCII-8BIT lineno:0 col_sep:"\\t" row_sep:"\n" quote_char:"\"" headers:true>

The file itself was produced as an XLS file. I have uploaded an edited version here (edited i gvim)


Solution

  • The issue was that I was reading from a Paperclip attachment, which needed to have the encoding set (overridden) before saving.

    Adding s3_headers in the model worked:

     has_attached_file :attachment, s3_headers: lambda { |attachment|
                                      { 
                                        'content-Type' => 'text/csv; charset=utf-16le'
                                      }
                                    }
    
    

    Thanks to Julien for tipping me off that the issue was related to the paperclip attachment (that solution works to read the file directly)