Search code examples
phpregexhashtagbengali

Regular expression can't match the whole Bengali word


I'm trying to use regular expression to match hashtags. When the language of a hashtag is English or Chinese, my code works fine. But when the language is Bengali, my code can't match the whole Bengali word.

Here is the code I'm testing with:

<?php

$hashtag = '#আয়াতুল_কুরসি';

preg_match('/(#\w+)/u', $hashtag, $matches);

print_r($matches);

?>

And the result is:

Array
(
    [0] => #আয়
    [1] => #আয়
)

I tried changing the pattern to '/(#\p{L}+)/u', but that didn't help.


Solution

  • The fact is that \w here does not match all diacritics that Bengali characters may contain. You need to allow them all:

    preg_match('/#[\w\p{M}]+/u', $hashtag, $matches);
    

    See the PHP demo.