Search code examples
kotlinio

Create file with UTF-16LE encoding


I am trying to create a tab separate CSV via kotlin. For the requirement we need to have UTF-16LE for the produced file encoding.

My stripped down code is something like this:

import java.io.File
import java.io.FileOutputStream
import java.io.OutputStreamWriter

fun main(args: Array<String>) {
    val fileOutputStream = FileOutputStream(File("bla.csv"))
    val writer = OutputStreamWriter(fileOutputStream, Charsets.UTF_16LE)
    writer.write("bla\tbla\tbla")
    writer.write("\n")
    writer.write("lab\tlab\tlab")
    writer.flush()
    writer.close()
}

So after executing this program the file generated has this information: (I am running file on the actual file)

file -I bla.csv 
bla.csv: application/octet-stream; charset=binary

This is what I get when I go for

Charsets.UTF_16LE

I have tried using other UTF-16 variation which made me even more confuse!

So if I use Charsets.UTF_16 it will result in:

file -I bla.csv 
bla.csv: text/plain; charset=utf-16be

And if I use Charsets.UTF_16BE it will result in:

file -I bla.csv 
bla.csv: application/octet-stream; charset=binary

So after a lot of self doubt and being sure that I am doing something wrong I have give up and come here.

Any guidance will be appreciated. Thanks in advance


Solution

  • I suspect this is a limitation of the file command, and not a problem with your code (which is fine*).

    If you write a Byte Order Mark (\uFEFF) as the first character of the file, then file recognises it fine:

    > file bla.csv 
    bla.csv: Little-endian UTF-16 Unicode text
    > file -I bla.csv 
    bla.csv: text/plain; charset=utf-16le
    

    The file should be perfectly valid without a BOM, though.  So I'm not sure why file isn't recognising it.  It may be that it's not always possible to safely identify UTF16-LE without a BOM, though you'd think a case like this (where every other byte is 0) would be a safe bet!


    (* Well, there are always potential improvements…  For example, it'd be safer to wrap the output in a call to writer.use() instead of closing the file manually.  You could wrap the OutputStreamWriter in a BufferedWriter for efficiency.  And in production code, you'd want some error handling, of course.  But none of that's related to the question!)