Search code examples
c#.nettamilrune

C# Tamil Runes: How to get the correct number of Tamil letters


I'm trying to figure out how to handle filenames in Tamil. I need to shorten them like this: "foobar.gif" -> "foo...gif".

I've learned today that some languages use more than one char to represent a letter and I discovered that C# has the Rune concept.

I can't get this to work with Tamil.

Take "தமிழ்.gif" for example:

I had hoped that "தமிழ்.gif".Length should be 6 but it's 9:

enter image description here

How can I get do a proper substring like "தமிழ்.gif".Substring(2) => "தமி" instead of "தம".

What am I missing?


Solution

  • This has to do with surrogate pairs, which are pairs of char that represent "single" characters in Unicode.

    See these question regarding Surrogate Pairs: What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?

    Is String.Replace(string,string) Unicode Safe in regards to Surrogate Pairs?

    When dealing with characters that are actually longer than a single character, you'll have to find the indices of the string arrays that are contained within your current string array.

    I should add, because of this, you'll have to create some "Unicode-Safe" methods for removal of characters or finding the indices, otherwise you may end up removing "half" of a valid Unicode character and be left with invalid Unicode