I am trying to create a tab separate CSV via kotlin. For the requirement we need to have UTF-16LE for the produced file encoding.
My stripped down code is something like this:
import java.io.File
import java.io.FileOutputStream
import java.io.OutputStreamWriter
fun main(args: Array<String>) {
val fileOutputStream = FileOutputStream(File("bla.csv"))
val writer = OutputStreamWriter(fileOutputStream, Charsets.UTF_16LE)
writer.write("bla\tbla\tbla")
writer.write("\n")
writer.write("lab\tlab\tlab")
writer.flush()
writer.close()
}
So after executing this program the file generated has this information: (I am running file on the actual file)
file -I bla.csv
bla.csv: application/octet-stream; charset=binary
This is what I get when I go for
Charsets.UTF_16LE
I have tried using other UTF-16 variation which made me even more confuse!
So if I use Charsets.UTF_16 it will result in:
file -I bla.csv
bla.csv: text/plain; charset=utf-16be
And if I use Charsets.UTF_16BE it will result in:
file -I bla.csv
bla.csv: application/octet-stream; charset=binary
So after a lot of self doubt and being sure that I am doing something wrong I have give up and come here.
Any guidance will be appreciated. Thanks in advance
I suspect this is a limitation of the file
command, and not a problem with your code (which is fine*).
If you write a Byte Order Mark (\uFEFF
) as the first character of the file, then file
recognises it fine:
> file bla.csv
bla.csv: Little-endian UTF-16 Unicode text
> file -I bla.csv
bla.csv: text/plain; charset=utf-16le
The file should be perfectly valid without a BOM, though. So I'm not sure why file
isn't recognising it. It may be that it's not always possible to safely identify UTF16-LE without a BOM, though you'd think a case like this (where every other byte is 0) would be a safe bet!
(* Well, there are always potential improvements… For example, it'd be safer to wrap the output in a call to writer.use()
instead of closing the file manually. You could wrap the OutputStreamWriter
in a BufferedWriter
for efficiency. And in production code, you'd want some error handling, of course. But none of that's related to the question!)