Search code examples
phpregexmbstring

How to cut string from start to second last dot of the string?


I have some string, for example:

cats, e.g. Barsik, are funny. And it is true. So,

And I want to get as result:

cats, e.g. Barsik, are funny.

My try:

mb_ereg_search_init($text, '((?!e\.g\.).)*\.[^\.]');
$match = mb_ereg_search_pos();

But it gets position of second dot (after word "true").

How to get desired result?


Solution

  • Since a naive approach works for you, I am posting an answer. However, please note that detecting a sentence end is a very difficult task for a regex, and although it is possible to some degree, an NLP package should be used for that.

    Having said that, I suggested using

    '~(?<!\be\.g)\.(?=\s+\p{Lu})~ui'
    

    The regex matches any dot (\.) that is not preceded with a whole word e.g (see the negative lookbehind (?<!\be\.g)), but that is followed with 1 or more whitespaces (\s+) followed with 1 uppercase Unicode letter \p{Lu}.

    See the regex demo

    The case insensitive i modifier does not impact what \p{Lu} matches.

    The ~u modifier is required since you are working with Unicode texts (like Russian).

    To get the index of the first occurrence, use a preg_match function with the PREG_OFFSET_CAPTURE flag. Here is a bit simplified regex you supplied in the comments:

    preg_match('~(?<!т\.н)(?<!т\.к)(?<!e\.g)\.(?=\s+\p{L})~iu', $text, $match, PREG_OFFSET_CAPTURE);
    

    See the lookaheads are executed one by one, and at the same location in string, thus, you do not have to additionally group them inside a positive lookahead. See the regex demo.

    IDEONE demo:

    $re = '~(?<!т\.н)(?<!т\.к)(?<!e\.g)\.(?=\s+\p{L})~iu';
    $str = "cats, e.g. Barsik, are funny. And it is true. So,"; 
    preg_match($re, $str, $match, PREG_OFFSET_CAPTURE);
    echo $match[0][1];