Search code examples
phphtmlweb-scrapingsimple-html-dom

How to format plaintext in PHP Simple HTML DOM Parser?


I'm trying to extract the content of a webpage in plain text - without the html tags. Here's some sample code:

$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html($url);
$result['body'] = $dom->find('body', 0)->plaintext;

The problem is that what I get in $result['body'] is very messy. The HTML was removed, sure, but sentences often merge into others since there are no spaces or periods to delimit where the text from one HTML tag ended, and text from the following tag begins.

An example:

<body>
    <div class="H2">Header</div>
    <div class="P">this is a paragraph</div>
    <div class="P">this is another paragraph</div>
</body>

Results in:

"Headerthis is a paragraphthis is another paragraph"

Desired result:

"Header. this is a paragraph. this is another paragraph"

Is there any way to format the result from plaintext or perhaps apply extra manipulation on the innertext before using plaintext to achieve clear delimiters for sentences?

EDIT:

I'm thinking of doing something like this:

foreach($dom->find('div') as $element) {
    $text = $element->plaintext;
    $result['body'] .= $text.'. ';
}

but there's a problem when the divs are nested, since it would add the content of the parent, which includes text from all children, and then add the content of the children, effectively duplicating the text. This can be fixed simply by checking if there is a </div> inside the $text though.

Perhaps I should try callbacks.


Solution

  • Possibly something like this? Tested.

    <?php
    require_once 'vendor/autoload.php';
    
    $dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html("index.html");
    
    $result['body'] = implode('. ', array_map(function($element) {
        return $element->plaintext;
    }, $dom->find('div')));
    
    echo $result['body'];
    
    <body>
        <div class="H2">Header</div>
        <div class="P">this is a paragraph</div>
        <div class="P">this is another paragraph</div>
    </body>
    

    enter image description here