Search code examples
c#localizationright-to-leftbidi

How to detect whether a character belongs to a Right To Left language?


What is a good way to tell whether a string contains text in a Right To Left language.

I have found this question which suggests the following approach:

public bool IsArabic(string strCompare)
{
  char[] chars = strCompare.ToCharArray();
  foreach (char ch in chars)
    if (ch >= '\u0627' && ch <= '\u0649') return true;
  return false;
}

While this may work for Arabic this doesn't seem to cover other RTL languages such as Hebrew. Is there a generic way to know that a particular character belongs to a RTL language?


Solution

  • Unicode characters have different properties associated with them. These properties cannot be derived from the code point; you need a table that tells you if a character has a certain property or not.

    You are interested in characters with bidirectional property "R" or "AL" (RandALCat).

    A RandALCat character is a character with unambiguously right-to-left directionality.

    Here's the complete list as of Unicode 3.2 (from RFC 3454):

    D. Bidirectional tables
    
    D.1 Characters with bidirectional property "R" or "AL"
    
    ----- Start Table D.1 -----
    05BE
    05C0
    05C3
    05D0-05EA
    05F0-05F4
    061B
    061F
    0621-063A
    0640-064A
    066D-066F
    0671-06D5
    06DD
    06E5-06E6
    06FA-06FE
    0700-070D
    0710
    0712-072C
    0780-07A5
    07B1
    200F
    FB1D
    FB1F-FB28
    FB2A-FB36
    FB38-FB3C
    FB3E
    FB40-FB41
    FB43-FB44
    FB46-FBB1
    FBD3-FD3D
    FD50-FD8F
    FD92-FDC7
    FDF0-FDFC
    FE70-FE74
    FE76-FEFC
    ----- End Table D.1 -----
    

    Here's some code to get the complete list as of Unicode 6.0:

    var url = "http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt";
    
    var query = from record in new WebClient().DownloadString(url).Split('\n')
                where !string.IsNullOrEmpty(record)
                let properties = record.Split(';')
                where properties[4] == "R" || properties[4] == "AL"
                select int.Parse(properties[0], NumberStyles.AllowHexSpecifier);
    
    foreach (var codepoint in query)
    {
        Console.WriteLine(codepoint.ToString("X4"));
    }
    

    Note that these values are Unicode code points. Strings in C#/.NET are UTF-16 encoded and need to be converted to Unicode code points first (see Char.ConvertToUtf32). Here's a method that checks if a string contains at least one RandALCat character:

    static void IsAnyCharacterRightToLeft(string s)
    {
        for (var i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1)
        {
            var codepoint = char.ConvertToUtf32(s, i);
            if (IsRandALCat(codepoint))
            {
                return true;
            }
        }
        return false;
    }