Search code examples
regexpowershellunicodehindi

How to detect if a string contains hindi (devnagri) in it with character and word count


Below is a example string -

$string = "abcde वायरस abcde"

I need to check weather this string contains any Hindi (Devanagari) content and if so the count of characters and words. I guess regex with unicode character class can work http://www.regular-expressions.info/unicode.html. But I am not able to figure out the correct regex statement.


Solution

  • To find out, if a string contains a Hindi (Devanagari) character, you need to have a full list of all Hindi characters. According to this website, the Hindi characters are the hexadecimal characters between 0x0900 and 0x097F (decimal 2304 to 2431).

    The regular expression pattern needs to match, if any of those characters are in the set. Therefore, you can use a pattern (actually a set of characters) to match the string, which looks like this:

    [\u0900\u0901\u0902 ... \u097D\u097E\u097F]

    Because it is rather cumbersome to manually write this list of characters down, you can generate this string by iterating over the decimal characters from 2304 to 2431 or over the hexadecimal characters.

    To count all words containing at least one Hindi character, you can use the following pattern. It contains white-space (\s) around the word or the beginning (^) or the end ($) around the word, and a global flag, to match every occurence (/g):

    /(?:^|\s)[\u0900\u0901\u0902 ... \u097D\u097E\u097F]+?(?:\s|$)/g

    Here is a live implementation in JavaScript:

    var numberOfHindiCharacters = 128;
    var unicodeShift = 0x0900;
    var hindiAlphabet = [];
    for(var i = 0; i < numberOfHindiCharacters; i++) {
      hindiAlphabet.push("\\u0" + (unicodeShift + i).toString(16));
    }
    
    var regex = new RegExp("(?:^|\\s)["+hindiAlphabet.join("")+"]+?(?:\\s|$)", "g");
    var string1 = "abcde वायरस abcde";
    var string2 = "abcde abcde";
    
    [ string1.match(regex), string2.match(regex) ].forEach(function(match) {
      if(match) {
        console.log("String contains " + match.length + " words with Hindi characters only.");
      } else {
        console.log("String does NOT contain any words with Hindi characters only.");
      }
    });