Search code examples
ruby-on-railselixirphoenix-frameworkiconv

Are Iconv.convert return values in wrong order?


I have a phoenix/elixir app and need to only have ASCII characters in my String. From what I tried and found here, this can only be done properly by Iconv.

:iconv.convert "utf-8", "ascii//translit", "árboles más grandes"
# arboles mas grandes

but when I run it on my mac it says:

# 'arboles m'as grandes

It seems it returns multiple letters for any character that had more than one byte in size and the order is turned around.

for example:

  • ä will turn to \"a
  • á will turn to 'a
  • ß will turn to ss
  • ñ will turn to ~n

I'm running it with IEx 1.2.5 on Mac.

Is there any way around this, or generally a better way to achieve the same functionality as rails transliterate?

EDIT:

So here is the update rails-like behaviour according to the accepted answer on Henkik N. It does the same thing as rails parameterize( turn whatever string into sth. that you can use as a part of a url)

defmodule RailsLikeHelpers do
    require Inflex

    # replace accented chars with their ascii equivalents
    def transliterate_string(abc) do
      return :iconv.convert("utf-8", "ascii//translit", String.normalize(abc))
    end

    def parameterize_string(abc) do
      parameterize_string(abc, "_")
    end

    def parameterize_string(abc,seperator) do
      abc
      |> String.strip
      |> transliterate_string
      |> Inflex.parameterize(seperator) # turns "Your Momma" into "your_momma"
      |> String.replace(~r[#{Regex.escape(seperator)}{2,}],seperator)  # No more than one of the separator in a row.
    end
  end

Solution

  • Running it through Unicode decomposition (as people kind of mentioned in the forum thread you linked to) seems to do it on my OS X:

    iex> :iconv.convert "utf-8", "ascii//translit", String.normalize("árboles más grandes", :nfd)
    "arboles mas grandes"
    

    Decomposition means it will be normalized so that e.g. "á" is represented as two Unicode codepoints ("a" and a combining accent) as opposed to a composed form where it's a single Unicode codepoint. So I guess iconv's ASCII transliteration removes standalone accents/diacritics, but converts composed characters to things like 'a.