Search code examples
phphtmlhtml-content-extraction

How to extract blocks of text from a HTML page?


I would like to extract blocks of texts with more than 100 words from a large HTML page using PHP. Whether the text is contained in <p>...</p> doesn't matter. I only care about the number of words that makes a coherent text block so texts outside of HTML paragraphs should also be taken into consideration.

How can this be done?


Solution

  • I use phpQuery. Are you familiar with jQuery? they share the same syntax. You might be concerned about installing a new library, but trust me this library is well worth the extra over head

    phpQuery

    You can then access it like this:

    foreach($doc->find('p') as $element){
       $element = pq($element);
       echo str_word_count($element->text());
    }