Search code examples
phpregexstringpreg-replacesanitize

Sanitizing sentence in PHP with preg_replace


This is my current sentence sanitizing function:

# sanitize sentence
function sanitize_sentence($string) {
    $string = preg_replace("/(?<!\d)[.,!?](?!\d)/", '$0 ', $string); # word,word. > word, word.
    $string = preg_replace("/(^\s+)|(\s+$)/us", "", preg_replace('!\s+!', ' ', $string)); # " hello    hello " > "hello hello"
    return $string;
}

Running some tests with this string:

$string = '     Helloooooo my frieeend!!!What are you doing??    Tell me what you like...........,please. ';

The result is:

echo sanitize_sentence($string);  
Helloooooo my frieeend! ! ! What are you doing? ? Tell me what you like. . . . . . . . . . . , please.

As you can see, I already managed to resolve some of the requirements, but i'm still stuck with some details. The final result should be:

Helloo my frieend! What are you doing? Tell me what you like..., please.

Which means, that all these requirements should be accomplished:

  1. There can be only one or three consecutive periods . or ...
  2. There can be only one consecutive comma ,
  3. There can be only one consecutive question mark ?
  4. There can be only one consecutive exclamation mark !
  5. A letter cannot repeat itself more than 2 times in a word. E.g.: mass (right), masss (wrong, and should be converted to mass)
  6. A space should be added always after these characters .,!? This is already working fine!
  7. In the case of 3 consecutive periods, the space is added only after the last period.
  8. Extra spaces (more than one space) should be eliminated and trimmed form both ends of the sentences. This is already working fine!

Solution

  • I think regex is a very appropriate technology for this. It's sanitisation, after all. Not grammer or syntax correction.

    function sanitize_sentence($i) {
    
        $o = $i;
    
        //  There can be only one or three consecutive periods . or ...
        $o = preg_replace('/\.{4,}/','… ',$o);
        $o = preg_replace('/\.{2}/','. ',$o);
    
        //  There can be only one consecutive ","
        $o = preg_replace('/,+/',', ',$o);
    
        //  There can be only one consecutive "!"
        $o = preg_replace('/\!+/','! ',$o);
    
        //  There can be only one consecutive "?"
        $o = preg_replace('/\?+/','? ',$o);  
    
        //  we just preemptively added a bunch of spaces.
        //  Let's remove any spaces between punctuation marks we may have added
        $o = preg_replace('/([^\s\w])\s+([^\s\w])/', '$1$2', $o);
    
        //  A letter cannot repeat itself more than 2 times in a word
        $o = preg_replace('/(\w)\1{2,}/','$1$1',$o);
    
        //  Extra spaces should be eliminated
        $o = preg_replace('/\s+/', ' ', $o);
        $o = trim($o);
    
        // we want three literal periods, not an ellipsis char
        $o = str_replace('…','...',$o);
    
        return $o;
    }