Search code examples
phputf-8southeast-asian-languages

Undefined offsets and diacritical marks


I'm trying to parse Laotian text with utf8_ireplace and I'm getting an

undefined offset notice.

The one thing I can see is that there are diacritical marks. Would that cause that warning? Or can someone give me a clue of why it would always be Laotian (of 6 languages I'm processing)?

Is there a special way that Laotian and similar languages (such as Tibetan) should be handled differently with regard to utf8_replace? Is it a known issue that it raises notices with some characters in those languages? Are diacritcals the issue or something else? Does anyone know how not to get notices besides turning off notice reporting?

Update: Actually me add that in Laotian there are no spaces between words so you have to separate the strings of characters, and that's what I am using utf8_replace for, but it's failing for Laotian even though it seems to work for Thai for example. So it's really I'm trying to break up strings of characters but for some reason the offsets are undefined. Tibetan also seems to have problems e.g. "ས"

Update

Here is the central question: Why is it that I get notices using utf8_replace on some words in Laotian?

(Joomla)

// Iterate through the terms and test if they contain the relevant characters.
for ($i = 0, $n = count($terms); $i < $n; $i++)
{
    $charMatches = array();
    if ($lang === 'zh')
    {
        $charCount = preg_match_all('#[\x{4E00}-\x{9FCF}]#mui', $terms[$i], $charMatches);
    }

    elseif ($lang === 'ja')
    {
        // Kanji (Han), Katakana and Hiragana are each checked
        $charCount = preg_match_all('#[\x{4E00}-\x{9FCF}]#mui', $terms[$i], $charMatches);
        $charCount += preg_match_all('#[\x{3040–\x{309F}]#mui', $terms[$i], $charMatches);
        $charCount += preg_match_all('#[\x{30A0}-\x{30FF}]#mui', $terms[$i], $charMatches);
    }
    elseif ($lang === 'th')
    {
        $charCount = preg_match_all('#[\x{0E00}-\x{0E7F}]#mui', $terms[$i], $charMatches);
    }
    elseif ($lang === 'km')
    {
        $charCount = preg_match_all('#[\x{1780}-\x{17FF}]#mui', $terms[$i], $charMatches);
    }
    elseif ($lang === 'lo')
    {
        $charCount = preg_match_all('#[\x{0E80}-\x{30EFF}]#mui', $terms[$i], $charMatches);
    }
    elseif ($lang === 'my')
    {
        $charCount = preg_match_all('#[\x{1000}-\x{109F}]#mui', $terms[$i], $charMatches);
    }
    elseif ($lang === 'bo')
    {
        $charCount = preg_match_all('#[\x{0F00}-\x{0FFF}]#mui', $terms[$i], $charMatches);
    }
    // Split apart any groups of characters.
    for ($j = 0; $j < $charCount; $j++)
    {
        if (isset($charMatches[0][$j]))
        {
            $tSplit = JString::str_ireplace($charMatches[0][$j], '', $terms[$i], null);

            if (!empty($tSplit))
            {
                $terms[$i] = $tSplit;
            }
            else
            {
                unset($terms[$i]);
            }

            $terms[] = $charMatches[0][$j];
        }
    }
}

// Reset array keys.
$terms = array_values($terms);

Solution

  • I think the offset error could refer to the regex used in preg_match. I've tested the regex for 'lo' using regex101.com and it returns this error:

    \x{30EFF} Character offset is too large. Reduce it to 4 hexadecimal characters or enable UTF-16 (u-modifier)

    The other regexes tested just fine.