Search code examples
c#stringencodingcharacternon-english

Reading each non-English character from a file


Let's say a file has non-English text. We can read the file contents with FileIO.ReadLinesAsync method. Now each line contains set of characters. How to extract each letter (non-English alphabet) from this string? Here i represented my question in C# code.

   List<string> finalAlphabets = new List<string>();
        IList<string> alphabetLines = await FileIO.ReadLinesAsync(_languageFile,UnicodeEncoding.Utf8);
        if (alphabetLines.Count != 0)
        {
            foreach (string alphabetLine in alphabetLines)
            {
                //lets say alphabetLine has "కాకికు", here i want to extract each letter from this and i want to add to finalAlphabets list 
                finalAlphabets.Add("కా"); // How to extract this letter from alphabetLine variable. If you look at the Length of alphabetLine , it shows 6, but actually in Telugu language it is 3 letter word.             
            }
        }

Solution

  • There is set of text information classes - TextInfo, StringInfo, and in particular you are likely looking for TextElementEnumerator which lets one to find "text element" boundaries.

    Simplified sample from MSDN article:

    var myTEE = System.Globalization.StringInfo.GetTextElementEnumerator( "కాకికు");
    while (myTEE.MoveNext())  {
         Console.WriteLine( "[{0}]:\t{1}\t{2}", 
             myTEE.ElementIndex, myTEE.Current, myTEE.GetTextElement() );
    }
    

    Produces following output:

    [0]:  కా  కా
    [2]:  కి  కి
    [4]:  కు  కు