Search code examples
phpphpquery

How to remove all of the formatting from some HTML


I have the following from a website which I am scraping but unfortunately the content also contains some font tags and could possibly contain other inline formatting in the future. I'm using PHPQuery to do this but a PHP only solution also works.

<p>
<font
color="#cc0000">
    <font
    color="#000000">Content</font>
        </font>
</p>
<p>Content</p>
<p>
    <font
    color="#cc0000">Content I wish to keep but font should be removed</font>
</p>
<p>
    <font
    color="#cc0000">Content I wish to keep but font should be removed</font>
</p>
<p>
    <font
    color="#cc0000">Content I wish to keep but font should be removed</font>
</p>
<p>
    <font
    color="#cc0000">Content I wish to keep but font should be removed</font>
</p>
<p>
    <font
    color="#000000">Content I wish to keep but font should be removed</font>
</p>
<p>Content</p>
</div>

Solution

  • Use strip_tags();

    strip_tags ($str, '<p><div>');
    

    this line will remove all the tags but P and DIV You can add more alowable tags to second argument.

    Example from php.net

     <?php
     $text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
     echo strip_tags($text);
     echo "\n";
    
     // Allow <p> and <a>
     echo strip_tags($text, '<p><a>');
     ?>
    

    The above example will output:

    Test paragraph. Other text
    <p>Test paragraph.</p> <a href="#fragment">Other text</a>