php regex pcre arabic-support positive-lookahead

Positive lookahead doesn't match Arabic text

Regex doesn't match Arabic text when using lookahead assertion

I am trying to split the text:

شكرا لك على المشاركة في هذه الدراسة. هذا الاستبيان يطلب معلومات عن:

stored in

$sentences = "شكرا لك على المشاركة في هذه الدراسة. هذا الاستبيان يطلب معلومات عن:";

with regex:

$pattern = "/(?<=\.)\s+(?=\p{IsArabic}+)/";

in function

preg_split($pattern, $sentences);

The regex doesn't match. It does match if I remove the lookahead assertion.

Why does that happen? What could be a workaround?

Solution

You may fix it by using the \p{Arabic} Unicode property class (see supported names here) and adding u modifier to the regex. Note that + quantifier after \p{Arabic} is redundant.

Use

$sentences = "شكرا لك على المشاركة في هذه الدراسة. هذا الاستبيان يطلب معلومات عن:";
$pattern = "/(?<=\.)\s+(?=\p{Arabic})/u";
print_r(preg_split($pattern, $sentences));

Result:

Array
(
    [0] => شكرا لك على المشاركة في هذه الدراسة.
    [1] => هذا الاستبيان يطلب معلومات عن:
)