Search code examples
phpphpword

Parse a word document with PHPWord to a string


I've tried several solutions to parse word documents to a string in PHP, however they sometimes have trouble with certain word documents. So I'm now trying PHPWord to attempt to parse the word document to a string.

I'm looking at this sample file in PHPWord which reads a Word document and outputs to another Word document:

include_once 'Sample_Header.php';

// Read contents
$name = basename(__FILE__, '.php');
$source = "resources/{$name}.doc";
echo date('H:i:s'), " Reading contents from `{$source}`", EOL;
$phpWord = \PhpOffice\PhpWord\IOFactory::load($source, 'MsDoc');

// (Re)write contents
$writers = array('Word2007' => 'docx', 'ODText' => 'odt', 'RTF' => 'rtf');
foreach ($writers as $writer => $extension) {
    echo date('H:i:s'), " Write to {$writer} format", EOL;
    $xmlWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, $writer);
    $xmlWriter->save("{$name}.{$extension}");
    rename("{$name}.{$extension}", "results/{$name}.{$extension}");
}

include_once 'Sample_Footer.php';

However, I don't want to output another entire Word document, I just want to parse the contents to a string in PHP. How can this be modified to output the content to a string?


Solution

  • You have to use the object you have received:

    $phpWord = \PhpOffice\PhpWord\IOFactory::load($source, 'MsDoc');
    

    It is a multidimensional object of arrays and objects, and you have to locate [elements] property, in which you have to locate [text] property. This [text] property contains the text extracted from your Word file.

    Please bear in mind that by default these two properties are protected, so you will have to change their status in the PHPWord library files - for [elements] it is AbstractContainer.php, and for [text] it is Text.php. Once you have changed the status of these two properties to public, you can extract them from your $phpWord object.

    I now can extract text from .doc files, but what I noticed is that PHPWord will just extract some 60% of text from any .doc file, sometimes just cutting the last word it extracted by half. So, if your file has 4,000 words, PHPWord gets only some 2,000 of them, somehow.

    I am at a loss here, actually, as to why PHPWord does not want to get ALL the text. No notices, no exceptions, just an object without a good half of text from a .doc file.