Search code examples
phphtml-parsingtext-parsing

PHP - extract text from HTML, translate and put it back


I'm using an API to translate my blog but it sometimes messes up with my html in a way that it gives me more work to fix everything.

What I'm now trying to do is to extract the content from the html, translate it and put it back where it was.

I have first tried to do this with preg_replace where I would replace every tag by something like ##a_number## and then revert back to the original tag once the text has been translated. Unfortunately it's very difficult to manage because I need to replace every tag by a unique value.

I have then tried it with "simple html dom" which can be found here: http://simplehtmldom.sourceforge.net/manual.htm

$html = str_get_html($content);
$str = $html;
$ret = $html->find('div');
foreach ($ret as $key=>$value)
    {  
        echo $value;
    }

This way I get all texts but there is still some html in the value (div inside div) and I don't know how I can put back translated text into the original object. The structure of this object is so complex that when displaying it, it crashes my browser.

I'm running a bit out of options and there are probably more straightforward ways of doing this. What I'd like to find is a way to get an object or array containing all the html on one side and all the text on the other side. I would loop through the text to get it translated and the merge back everything to avoid breaking the html.

Do you see better options to achieve this?

thanks Laurent


Solution

  • For example, I have the following HTML, where all the words are lowercase:

    <div>
        <h2>page not found!</h2>
        <p>go to <a href="/">home page</a> or use the <a href="/search">search</a>.</p>
    </div>
    

    My task is to convert text to capitalized words. To solve it, I fetch all text nodes and convert them using the ucwords function (of course, you should use your translation function instead of it).

    libxml_use_internal_errors(true);
    $dom = new DomDocument();
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    $xpath = new DOMXPath($dom);
    
    foreach ($xpath->query('//text()') as $text) {
        if (trim($text->nodeValue)) {
            $text->nodeValue = ucwords($text->nodeValue);
        }
    }
    
    echo $dom->saveHTML();
    

    The above outputs the following:

    <div>
        <h2>Page Not Found!</h2>
        <p>Go To <a href="/">Home Page</a> Or Use The <a href="/search">Search</a>.</p>
    </div>