Search code examples
utf-8elixiriconv

iconv for Elixir


I download a CSV file and save it with this code:

body = HTTPoison.get!(url).body 
       |> String.replace("ü", "ü") 
       |> String.replace("ö", "ö")
File.write!("/tmp/example.csv", body)

To do the String.replace/3 to replace ü with ü is of course not a good way. HTTPoison tells me that the body is {"Content-Type", "csv;charset=utf-8"}.

How can I solve this without String.replace/3?


Solution

  • What you have here is data that is first UTF-8 encoded, then the bytes are treated as latin1 encoding and encoded to UTF-8 again.

    A hex dump snippet from the data in that URL shows this:

    00007d20: 2c22 222c 2c2c 224f 7269 6769 6e3a 2044  ,"",,,"Origin: D
    00007d30: c383 c2bc 7373 656c 646f 7266 222c 224b  ....sseldorf","K
    00007d40: 6579 776f 7264 733a 204c 6173 7420 4d69  eywords: Last Mi
    

    ü is encoded as <<0xc3, 0x83, 0xc2, 0xbc>> which was probably created like this:

    iex(1)> "ü\0"
    <<195, 188, 0>>
    iex(2)> <<195::utf8, 188::utf8>> == <<0xc3, 0x83, 0xc2, 0xbc>>
    true
    

    To reverse this process, you can use a combination of :unicode.characters_to_list and :erlang.list_to_binary.

    iex(3)> <<0xc3, 0x83, 0xc2, 0xbc>> |> :unicode.characters_to_list |> :erlang.list_to_binary
    "ü"
    

    That URL also includes a BOM at the start:

    00000000: efbb bf22 5a75 7069 6422 2c22 5072 6f67  ..."Zupid","Prog
              ^^^^ ^^
    00000010: 7261 6d49 6422 2c22 4d65 7263 6861 6e74  ramId","Merchant
    00000020: 5072 6f64 7563 744e 756d 6265 7222 2c22  ProductNumber","
    

    This can be removed using |> Enum.drop(1) after :unicode.characters_to_list.

    So the following should work for you:

    HTTPoison.get!(url).body
    |> :unicode.characters_to_list
    |> Enum.drop(1)
    |> :erlang.list_to_binary