Search code examples
javascriptregexunicodehindixregexp

How to use whole word regex search for Devanagari text?


My HTML code with Devanagari words

<html>
<head>
<title>TODO</title>
<meta charset="UTF-8">
</head>
<body>
    मंत्री मुख्यमंत्री 
</body>
    <script src="jquery-1.11.0.min.js"></script>
    <script src="xregexp_20.js"></script>
    <script src="addons/unicode/unicode-base.js"></script>
    <script src="addons/unicode/unicode-scripts.js"></script>
    <script src="my.js"></script>
</html>

My javascript code

var html = document.getElementsByTagName("html")[0];
var fullpage_content = html.innerHTML;

var regex = RegExp("मंत्री", "g");
var count = fullpage_content.match(regex);
console.log("count in page : " + count+ ", " + count.length);

//use of word boundry ,not supported by devanagari characters
regex = RegExp("\\bमंत्री\\b", "g");
count = fullpage_content.match(regex);
console.log("count in page : " + count);

regex = XRegExp("मंत्री");
var match = XRegExp.matchChain(fullpage_content, [regex]);
console.log("count in page : " + match + ", " + match.length);

//xregex do not support word boundry \\b
regex = XRegExp("\\bमंत्री\\b");
match = XRegExp.matchChain(fullpage_content, [regex]);
console.log("count in page : " + match + ", " + match.length);

Output of js (on Chrome)

count in page : मंत्री,मंत्री, 2

count in page : null

count in page : मंत्री,मंत्री, 2

count in page : , 0

Whole word search should give one as answer, but regexp and XRegExp both are failing me. I need some help.


Solution

  • regex = XRegExp("(?:^|[^\\p{Devanagari}\\p{L}])मंत्री(?=[^\\p{Devanagari}\\p{L}]|$)");
    

    solved it. Thanks to Louis in particular. I tested a more rigorous test case before finalizing.

    मंत्री मंत्रीमंत्री मंत्रीमं ममंत्री मंत्री मंत्री मंत्री. .मंत्री मंत्री- <मंत्री मंत्री> मंत्री, ,मंत्री ,मंत्री, मंत्री,मंत्री, ,मंत्री,मंत्री,

    मंत्री, मंत्री

    मंत्री,मंत्री मंत्री मुख्यमंत्री