This is my current sentence sanitizing function:
# sanitize sentence
function sanitize_sentence($string) {
$string = preg_replace("/(?<!\d)[.,!?](?!\d)/", '$0 ', $string); # word,word. > word, word.
$string = preg_replace("/(^\s+)|(\s+$)/us", "", preg_replace('!\s+!', ' ', $string)); # " hello hello " > "hello hello"
return $string;
}
Running some tests with this string:
$string = ' Helloooooo my frieeend!!!What are you doing?? Tell me what you like...........,please. ';
The result is:
echo sanitize_sentence($string);
Helloooooo my frieeend! ! ! What are you doing? ? Tell me what you like. . . . . . . . . . . , please.
As you can see, I already managed to resolve some of the requirements, but i'm still stuck with some details. The final result should be:
Helloo my frieend! What are you doing? Tell me what you like..., please.
Which means, that all these requirements should be accomplished:
I think regex is a very appropriate technology for this. It's sanitisation, after all. Not grammer or syntax correction.
function sanitize_sentence($i) {
$o = $i;
// There can be only one or three consecutive periods . or ...
$o = preg_replace('/\.{4,}/','… ',$o);
$o = preg_replace('/\.{2}/','. ',$o);
// There can be only one consecutive ","
$o = preg_replace('/,+/',', ',$o);
// There can be only one consecutive "!"
$o = preg_replace('/\!+/','! ',$o);
// There can be only one consecutive "?"
$o = preg_replace('/\?+/','? ',$o);
// we just preemptively added a bunch of spaces.
// Let's remove any spaces between punctuation marks we may have added
$o = preg_replace('/([^\s\w])\s+([^\s\w])/', '$1$2', $o);
// A letter cannot repeat itself more than 2 times in a word
$o = preg_replace('/(\w)\1{2,}/','$1$1',$o);
// Extra spaces should be eliminated
$o = preg_replace('/\s+/', ' ', $o);
$o = trim($o);
// we want three literal periods, not an ellipsis char
$o = str_replace('…','...',$o);
return $o;
}