Search code examples
c#.netunicodediacriticspolish

Removing diacritics in Polish


I'm trying to remove diacritic characters from a pangram in Polish. I'm using code from Michael Kaplan's blog http://www.siao2.com/2007/05/14/2629747.aspx, however, with no success.

Consider following pangram: "Pchnąć w tę łódź jeża lub ośm skrzyń fig.". Everything works fine but for letter "ł", I still get "ł". I guess the problem is that "ł" is represented as single unicode character and there is no following NonSpacingMark.

Do you have any idea how I can fix it (without relying on custom mapping in some dictionary - I'm looking for some kind of unicode conversion)?


Solution

  • The approach taken in the article is to remove Mark, Nonspacing characters. Since as you correctly point out "ł" is not composed of two characters (one of which is Mark, Nonspacing) the behavior you see is expected.

    I don't think that the structure of Unicode allows you to accomplish a fully automated remapping (the author of the article you reference reaches the same conclusion).

    If you're just interested in Polish characters, at least the mapping is small and well-defined (see e.g. the bottom of http://www.biega.com/special-char.html). For the general case, I do no think an automated solution exists for characters that are not composed of a standard character plus a Mark, Nonspacing character.