Search code examples
phputf-8character-encodingdoc

Reading characters "ěščřž..." from Word doc file via php function, returns question mark in diamond


i found function in PHP, that reads MS Word .doc files quite good, but when file contains some of "ěščř..." character, it returns question mark in diamont character (unrecognized character?)

function looks like this

if ( file_exists($filename) ) {

        if ( ($fh = fopen($filename, 'r')) !== false ) {

            $headers = fread($fh, 0xA00);

            # 1 = (ord(n)*1) ; Document has from 0 to 255 characters
            $n1 = ( ord($headers[0x21C]) - 1 );

            # 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
            $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );

            # 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
            $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );

            # (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
            $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );

            # Total length of text in the document
            $textLength = ($n1 + $n2 + $n3 + $n4);

            $extracted_plaintext = fread($fh, $textLength);

            # if you want the plain text with no formatting, do this
            //echo $extracted_plaintext;
            echo mb_detect_encoding($extracted_plaintext);

            # if you want to see your paragraphs in a web page, do this
            echo nl2br($extracted_plaintext);

        }

    }

Also im trying to change character encoding to UTF-8 in which i have whole web content by something like this

$extracted_plaintext = iconv("UTF-8","UTF-8//IGNORE",$extracted_plaintext);

But it only removes invalid characters, so text is also unreadable. So im not sure, if this issue is realy in charset, or anything else? I think UTF-8 is correct, because if i use echo mb_detect_encoding($extracted_plaintext); it returns UTF-8

edit: here is attached example of file


Solution

  • Read doc/ppt file is kinda problematic (and not even talking about special character) So i found more usable to convert document into pdf and then read pdf which is much more easier. Maybe this will help someone in future to not waste as much time as i to try to read .doc or .ppt