Search code examples
phpsplitcpu-word

Create array of words from a string of text


I would like to split a text into single words using PHP. Do you have any idea how to achieve this?

My approach:

function tokenizer($text) {
    $text = trim(strtolower($text));
    $punctuation = '/[^a-z0-9äöüß-]/';
    $result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
    for ($i = 0; $i < count($result); $i++) {
        $result[$i] = trim($result[$i]);
    }
    return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));

Is this a good approach? Do you have any idea for improvement?

Thanks in advance!


Solution

  • Use the class \p{P} which matches any unicode punctuation character, combined with the \s whitespace class.

    $result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);
    

    This will split on a group of one or more whitespace characters, but also suck in any surrounding punctuation characters. It also matches punctuation characters at the beginning or end of the string. This discriminates cases such as "don't" and "he said 'ouch!'"