I am trying to split raw text into sentences. So I simply use preg_split()
function and split a raw text into sentence based on occurrence of ?
, .
and ;
. But as expected I faced some problem due to some special case of .
for example "Dr.", "Mr.", etc.
How can I exclude such word, or patter from spliting?
preg_split('/(\. )|(\? )|(\; )!(Mr\.)/', $content);
You can add negative lookbehind to the regex to make sure that the dot is not preceded by "Mr" and company:
preg_split('/((?<!(Mr|Dr))\.|\?|;) /', $content);
I also simplified the regex a little bit. You should also consider substituting \s|$
(any whitespace or end of input) for the single space at the end of the current expression.