Search code examples
c#unicode.net-2.0tamil

TextElement Enumerator Class Bug or (Tamil) Unicode Bug


why the TextElementEnumerator not properly parsing the Tamil Unicode character.

using System;
using System.Collections.Generic;
using System.Globalization;

namespace Glyphtest
{
    internal class Program
    {
        private static void Main()
        {
            const string unicodetxt1 = "ஊரவர் கெளவை";
            List<string> output = Syllabify(unicodetxt1);
            Console.WriteLine(output.Count);
            const string unicodetxt2 = "கௌவை";
            output = Syllabify(unicodetxt2);
            Console.WriteLine(output.Count);
        }

        public static List<string> Syllabify(string unicodetext)
        {
            if (string.IsNullOrEmpty(unicodetext)) return null;
            TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(unicodetext);
            var data = new List<string>();
            while (enumerator.MoveNext())
                data.Add(enumerator.Current.ToString());
            return data;
        }
    }
}

Following above code sample deals with Unicode character

'கௌ'-> 0x0bc8 (க) +0xbcc(ௌ). (Correct Form)

'கௌ'->0x0bc8 (க) +0xbc6(ெ) + 0xbb3(ள) (In Correct Form)

Is it bug in Text Element Enumerator Class , why its not to Enumerate it properly from the string.

i.e கெளவை => 'கெள'+ 'வை' has to enumerated in Correct form

கெளவை => 'கெ' +'ள' +'வை' not to be enumerated in Incorrect form.

If so how to overcome this issue.


Solution

  • Its not been bug with Unicode character or TextElementEnumerator Class, As specific to the lanaguage (Tamil)

    letter made by any Tamil consonants followed by visual glyph

    for eg- க -\u0b95 ெ -\u0bc6 ள -\u0bb3

    form Tamil character 'கெள' while its seems similar to formation of visual glyph

    க -\u0b95 ௌ-\u0bcc

    and its right form to solution. hence before enumerating Tamil character we have replace irregular formation of character.

    As with rule of Tamil Grammar (ஔகாரக் குறுக்கம்) the visual glyph (ௌ) will come as starting letter of a word.

    so that. the above code is to be should processed as

    internal class Program
    {
        private static void Main()
        {
            const string unicodetxt1 = "ஊரவர் கெளவை";
            List<string> output = Syllabify(unicodetxt1);
            Console.WriteLine(output.Count);
            const string unicodetxt2 = "கௌவை";
            output = Syllabify(unicodetxt2);
            Console.WriteLine(output.Count);
        }
    
        public static string CheckVisualGlyphPattern(string txt)
        {
            string[] data = txt.Split(new[] { ' ', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
            string list = string.Empty;
            var rx = new Regex("^(.*?){1}(\u0bc6){1}(\u0bb3){1}");
            foreach (string s in data)
            {
                var matches = new List<Match>();
                string outputs = rx.Replace(s, match =>
                {
                    matches.Add(match);
                    return string.Format("{0}\u0bcc", match.Groups[1].Value);
                });
                list += string.Format("{0} ", outputs);
            }
            return list.Trim();
        }
    
        public static List<string> Syllabify(string unicodetext)
        {
            var processdata = CheckVisualGlyphPattern(unicodetext);
            if (string.IsNullOrEmpty(processdata)) return null;
            TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(processdata);
            var data = new List<string>();
            while (enumerator.MoveNext())
                data.Add(enumerator.Current.ToString());
            return data;
        }
    }
    

    It produce the appropriate visual glyph while enumerating.