Search code examples
phpregexstringpreg-splittext-segmentation

How can I split a sentence into words and punctuation marks?


For example, I want to split this sentence:

I am a sentence.

Into an array with 5 parts; I, am, a, sentence, and ..

I'm currently using preg_split after trying explode, but I can't seem to find something suitable.

This is what I've tried:

$sentence = explode(" ", $sentence);
/*
returns array(4) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence."
}
*/

And also this:

$sentence = preg_split("/[.?!\s]/", $sentence);
/*
returns array(5) {
  [0]=>
  string(1) "I"
  [1]=>
  string(2) "am"
  [2]=>
  string(1) "a"
  [3]=>
  string(8) "sentence"
  [4]=>
  string(0) ""
}
*/

How can this be done?


Solution

  • You can split on word boundaries:

    $sentence = preg_split("/(?<=\w)\b\s*/", 'I am a sentence.');
    

    Pretty much the regex scans until a word character is found, then after it, the regex must capture a word boundary and some optional space.

    Output:

    array(5) {
      [0]=>
      string(1) "I"
      [1]=>
      string(2) "am"
      [2]=>
      string(1) "a"
      [3]=>
      string(8) "sentence"
      [4]=>
      string(1) "."
    }