Search code examples
phputf-8comms-word

Writing UTF-8 strings to Word using PHP/COM


I'm trying to generate a Word document using data from a MySQL database using PHP/COM. If the data from the database is simple ASCII text (ex. "hello"), it displays correctly in the Word document. If the data contains non-ASCII (multi-byte) characters (ex. "Māori"), they display correctly, but have "funny" characters at the end (such as NULL, spaces or Chinese symbols).

Environment: I'm using Windows 7 Enterprise, Apache, MySQL, PHP 5.2.17, and Microsoft Office 2010.

Here is a simplified example - I don't even use the database or write to a Word document, but simply use the Word CleanString method to reproduce the problem:

private function _cleanString($wordApp, $str)
{
    $vStr = new VARIANT($str, VT_BSTR, CP_UTF8);
    $bytes = strlen($vStr);
    $chars = mb_strlen($vStr, "UTF-8");
    echo "Test string: $vStr (bytes=$bytes, chars=$chars)<br/>";
    $vStr = $wordApp->CleanString($vStr);
    $bytes = strlen($vStr);
    $chars = mb_strlen($vStr, "UTF-8");
    echo "Test string (after cleaning): $vStr (bytes=$bytes, chars=$chars)<br/>";
    echo "<br/>";
}

public function testUtf8Strings()
{
    com_load_typelib('Word.Application');
    // Specifying codepage as CP_UTF8 to let COM/Word know strings I pass in will be in UTF-8 format.
    $wordApp = new COM("word.application", null, CP_UTF8) or die ("couldn't create an instance of word");
    echo "Loaded Word, version {$wordApp->Version} <br/>";
    $wordApp->visible = false;

    echo "<br/>";
    $this->_cleanString($wordApp, 'No multi-byte characters.');
    $this->_cleanString($wordApp, 'Multi-byte chars: Māori 楠 test.');
    $this->_cleanString($wordApp, 'Multi-byte chars: Ā ā Ē ē Ī.');

    $wordApp->Quit(false); // Imortant: must say 'false', otherwise Word does not close
    $wordApp = null;
    echo "Quit Word.";

    return;
}

The HTML output is:

Loaded Word, version 14.0

Test string: No multi-byte characters. (bytes=25, chars=25)
Test string (after cleaning): No multi-byte characters. (bytes=25, chars=25)

Test string: Multi-byte chars: Māori 楠 test. (bytes=34, chars=31)
Test string (after cleaning): Multi-byte chars: Māori 楠 test. 5⹮ (bytes=39, chars=34)

Test string: Multi-byte chars: Ā ā Ē ē Ī. (bytes=33, chars=28)
Test string (after cleaning): Multi-byte chars: Ā ā Ē ē Ī. 琠獥⹴㔠 (bytes=46, chars=33)

Quit Word.

The CleanString method removes non-printing characters from the given string and changes them to spaces. Since my strings are already "clean", I expect to get the same string back. This is not the case when my string has multi-byte characters. Looks like Word uses the number of bytes from the original string to be the number of characters in the returned string.


Solution

  • It turns out this was a PHP bug (https://bugs.php.net/bug.php?id=66431) fixed in PHP 5.4.29. I tested with PHP 5.5.19 and the problem no longer occurs. The HTML output is:

    Loaded Word, version 14.0
    
    Test string: No multi-byte characters. (bytes=25, chars=25)
    Test string (after cleaning): No multi-byte characters. (bytes=25, chars=25)
    
    Test string: Multi-byte chars: Māori 楠 test. (bytes=34, chars=31)
    Test string (after cleaning): Multi-byte chars: Māori 楠 test. (bytes=34, chars=31)
    
    Test string: Multi-byte chars: Ā ā Ē ē Ī. (bytes=33, chars=28)
    Test string (after cleaning): Multi-byte chars: Ā ā Ē ē Ī. (bytes=33, chars=28)
    
    Quit Word.