I have some string, for example:
cats, e.g. Barsik, are funny. And it is true. So,
And I want to get as result:
cats, e.g. Barsik, are funny.
My try:
mb_ereg_search_init($text, '((?!e\.g\.).)*\.[^\.]');
$match = mb_ereg_search_pos();
But it gets position of second dot (after word "true").
How to get desired result?
Since a naive approach works for you, I am posting an answer. However, please note that detecting a sentence end is a very difficult task for a regex, and although it is possible to some degree, an NLP package should be used for that.
Having said that, I suggested using
'~(?<!\be\.g)\.(?=\s+\p{Lu})~ui'
The regex matches any dot (\.
) that is not preceded with a whole word e.g
(see the negative lookbehind (?<!\be\.g)
), but that is followed with 1 or more whitespaces (\s+
) followed with 1 uppercase Unicode letter \p{Lu}
.
See the regex demo
The case insensitive i
modifier does not impact what \p{Lu}
matches.
The ~u
modifier is required since you are working with Unicode texts (like Russian).
To get the index of the first occurrence, use a preg_match
function with the PREG_OFFSET_CAPTURE
flag. Here is a bit simplified regex you supplied in the comments:
preg_match('~(?<!т\.н)(?<!т\.к)(?<!e\.g)\.(?=\s+\p{L})~iu', $text, $match, PREG_OFFSET_CAPTURE);
See the lookaheads are executed one by one, and at the same location in string, thus, you do not have to additionally group them inside a positive lookahead. See the regex demo.
$re = '~(?<!т\.н)(?<!т\.к)(?<!e\.g)\.(?=\s+\p{L})~iu';
$str = "cats, e.g. Barsik, are funny. And it is true. So,";
preg_match($re, $str, $match, PREG_OFFSET_CAPTURE);
echo $match[0][1];