Search code examples
phpextractweb-scrapingtext-extractionsimple-html-dom

php: Get plain text from html - simplehtmldom or php strip_tags?


I am looking at getting the plain text from html. Which one should I choose, php strip_tags or simplehtmldom plaintext extraction?

One pro for simplehtmldom is support of invalid html, is that sufficient in itself?


Solution

  • You should probably use smiplehtmldom for the reason you mentioned and that strip_tags may also leave you non-text elements like javascript or css contained within script/style blocks

    You would also be able to filter text from elements that aren't displayed (inline style=display:none)

    That said, if the html is simple enough, then strip_tags may be faster and will accomplish the same task