I download a CSV file and save it with this code:
body = HTTPoison.get!(url).body
|> String.replace("ü", "ü")
|> String.replace("ö", "ö")
File.write!("/tmp/example.csv", body)
To do the String.replace/3
to replace ü
with ü
is of course not a good way. HTTPoison tells me that the body is {"Content-Type", "csv;charset=utf-8"}
.
How can I solve this without String.replace/3
?
What you have here is data that is first UTF-8 encoded, then the bytes are treated as latin1 encoding and encoded to UTF-8 again.
A hex dump snippet from the data in that URL shows this:
00007d20: 2c22 222c 2c2c 224f 7269 6769 6e3a 2044 ,"",,,"Origin: D
00007d30: c383 c2bc 7373 656c 646f 7266 222c 224b ....sseldorf","K
00007d40: 6579 776f 7264 733a 204c 6173 7420 4d69 eywords: Last Mi
ü
is encoded as <<0xc3, 0x83, 0xc2, 0xbc>>
which was probably created like this:
iex(1)> "ü\0"
<<195, 188, 0>>
iex(2)> <<195::utf8, 188::utf8>> == <<0xc3, 0x83, 0xc2, 0xbc>>
true
To reverse this process, you can use a combination of :unicode.characters_to_list
and :erlang.list_to_binary
.
iex(3)> <<0xc3, 0x83, 0xc2, 0xbc>> |> :unicode.characters_to_list |> :erlang.list_to_binary
"ü"
That URL also includes a BOM at the start:
00000000: efbb bf22 5a75 7069 6422 2c22 5072 6f67 ..."Zupid","Prog
^^^^ ^^
00000010: 7261 6d49 6422 2c22 4d65 7263 6861 6e74 ramId","Merchant
00000020: 5072 6f64 7563 744e 756d 6265 7222 2c22 ProductNumber","
This can be removed using |> Enum.drop(1)
after :unicode.characters_to_list
.
So the following should work for you:
HTTPoison.get!(url).body
|> :unicode.characters_to_list
|> Enum.drop(1)
|> :erlang.list_to_binary