php offset preg-match-all preg-split multibyte-characters

Isolate all words in a string and the number of (multibyte-safe) characters that preceeded each word

I want to use preg_split() with its PREG_SPLIT_OFFSET_CAPTURE option to capture both the word and the index where it begins in the original string.

However my string contains multibyte characters which is throwing off the counts. There doesn't seem to be a mb_ equivalent to this. What are my options?

Example:

$text = "Hello world — goodbye";

$words = preg_split("/(\w+)/x",
                    $text,
                    -1,
                    PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_OFFSET_CAPTURE);

foreach($words as $word) {
    print("$word[0]: $word[1]<br>");
}

This outputs:

Hello: 0
: 5
world: 6
— : 11
goodbye: 16

Because the dash is is an em-dash, rather than a standard hyphen, it's a multibyte character - so "goodbye"s offset comes out as 16 instead of 14.

Solution

Over a year later I was revisiting this and came up with a function to do this better. The good thing is it handles multibyte strings without having to ditch the multibyte characters entirely. The bad thing is that it can't use a regular expression like preg_split() does.

/**
 * Splits a piece of text into individual words and the words' position within
 * the text.
 *
 * @param string $text The text to split.
 * @return array Each element is an array, of the word and its 0-based position.
 */
function split_offset_capture($text) {
    $words = array();

    // We split into words based on these characters:
    $non_word_chars = array(
        " ", "-", "–", "—", ".", ",", ";" ,":", "(", ")", "/",
        "\\", "?", "!", "*", "'", "’", "\n", "\r", "\t",
    );

    // To keep track within the loop:
    $word_started = FALSE;
    $current_word = "";
    $current_word_position = 0;

    $characters = mb_str_split($text);

    foreach($characters as $i => $letter) {
        if ( ! in_array($letter, $non_word_chars)) {
            // A character in a word.
            if ( ! $word_started) {
                // We're starting a brand new word.
                if ($current_word != "") {
                    // Save the previous, now complete, word's info.
                    $words[] = array($current_word, $current_word_position);
                }
                $current_word_position = $i;
                $word_started = TRUE;
                $current_word = "";
            }
            $current_word .= $letter;
        } else {
            $word_started = FALSE;
        }
    };

    // Add on the final word.
    $words[] = array($current_word, $current_word_position);

    return $words;
}

Doing this:

$text = "Héllo world — goodbye";

$words = split_offset_capture($text);

Ends up with $words containing this:

array(
    array("Héllo", 0),
    array("world", 6),
    array("goodbye", 14),
);

You might need to add further characters to $non_word_chars.

For real-world texts one awkward thing is handling punctuation that immediately follows words (e.g. Russ' or Russ’), or within words (e.g. Bob's, Bob’s or new-found). To cope with this I came up with this altered function that has three arrays of characters to look for. So it perhaps does more than preg_split() but, again, doesn't use regular expressions:

/**
 * Splits a piece of text into individual words and the words' position within
 * the text.
 *
 * @param string $text The text to split.
 * @return array Each element is an array, of the word and its 0-based position.
 */
function split_offset_capture_2($text) {
    $words = array();

    // We split into words based on these characters:
    $non_word_chars = array(
        " ", "-", "–", "—", ".", ",", ";" ,":", "(", ")", "/",
        "\\", "?", "!", "*", "'", "’", "\n", "\r", "\t"
    );

    // EXCEPT, these characters are allowed to be WITHIN a word:
    // e.g. "up-end", "Bob's", "O'Brien"
    $in_word_chars = array("-", "'", "’");

    // AND, these characters are allowed to END a word:
    // e.g. "Russ'"
    $end_word_chars = array("'", "’");

    // To keep track within the loop:
    $word_started = FALSE;
    $current_word = "";
    $current_word_position = 0;

    $characters = mb_str_split($text);

    foreach($characters as $i => $letter) {
        if ( ! in_array($letter, $non_word_chars)
            ||
            (
                // It's a non-word-char that's allowed within a word.
                in_array($letter, $in_word_chars)
                &&
                ! in_array($characters[$i-1], $non_word_chars)
                &&
                ! in_array($characters[$i+1], $non_word_chars)
            )
            ||
            (
                // It's a non-word-char that's allowed at the end of a word.
                in_array($letter, $end_word_chars)
                &&
                ! in_array($characters[$i-1], $non_word_chars)
            )
        ) {
            // A character in a word.
            if ( ! $word_started) {
                // We're starting a brand new word.
                if ($current_word != "") {
                    // Save the previous, now complete, word's info.
                    $words[] = array($current_word, $current_word_position);
                }
                $current_word_position = $i;
                $word_started = TRUE;
                $current_word = "";
            }
            $current_word .= $letter;
        } else {
            $word_started = FALSE;
        }
    };

    // Add on the final word.
    $words[] = array($current_word, $current_word_position);

    return $words;
}

So if we have:

$text = "Héllo Bob's and Russ’ new-found folks — goodbye";

then the first function (split_offset_capture()) gives us:

array(
    array("Héllo", 0),
    array("Bob", 6),
    array("s", 10),
    array("and", 12),
    array("Russ", 16),
    array("new", 22),
    array("found", 26),
    array("folks", 32),
    array("goodbye", 40),
);

While the second function (split_offset_capture_2()) gets us:

array(
    array("Héllo", 0),
    array("Bob's", 6),
    array("and", 12),
    array("Russ’", 16),
    array("new-found", 22),
    array("folks", 32),
    array("goodbye", 40),
);