Search code examples
javaandroidregexarabicurdu

How to know text is Arabic or in Urdu


I want to know is text contain any letter in Urdu or Arabic..using this condition which produce false results when special characters comes.what is right way to do it .any library or what is right regex for this ?

   if (cap.replaceAll("\\s+", "").matches("[A-Za-z]+")
                    || cap.replaceAll("\\s+", "").matches("[A-Za-z0-9]+")) {
                Log.d("isUrdu", "false");
                caption.setTypeface(Typeface.DEFAULT);
                caption.setTextSize(16);

            } else {
                Log.d("isUrdu", "True");
             /*   if (Build.VERSION.SDK_INT > Build.VERSION_CODES.JELLY_BEAN_MR1) {*/
                    caption.setTypeface(typeface);
                    caption.setTextSize(20);

         /*       }*/
            }

Solution

  • Taking a look at the Wikipedia Urdu alphabet, it includes the following Unicode ranges:

    U+0600 to U+06FF
    U+0750 to U+077F
    U+FB50 to U+FDFF
    U+FE70 to U+FEFF
    

    To match an Arabic letter, you may use a \p{InArabic} Unicode property class.

    So, you may use

    if (cap.matches("(?s).*[\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70‌​-\\uFEFF].*"))
    {
        /*There is an Urdu character*/
    } 
    else if (cap.matches("(?s).*\\p{InArabic}.*"))
    {  
        /* The string contains an Arabic character */ 
    }
    else { /*No Arabic nor Urdu chars detected */ }
    

    Note that (?s) enables the DOTALL modifier so that . could match linebreak symbols, too.

    For better performance with matches, you may use reverse classes instead of the first .*: "(?s)[^\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70‌​-\\uFEFF]*[\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70‌​-\\uFEFF].*" and "(?s)\\P{InArabic}*\\p{InArabic}.*" respectively.

    Note you may also use shorter "[\\u0600-\\u06FF\\u0750-\\u077F\\uFB50-\\uFDFF\\uFE70‌​-\\uFEFF]" and "\\p{InArabic}" patterns with Matcher#find().