Search code examples
phpsimple-html-dom

Only leave some tags in a html string using PHP


I am crawling a website with simple_html_dom and need the result that would be somewhere between ->innertext and ->plaintext.

For example, here is the source string:

<span lang="EN-CA">[28]<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span lang="EN-CA">The Canadian trade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive use of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i> provides:</span>

I need to get rid of the span tags but not their contents (unless the span only contains &nbsp;'s) but retain <i>, <u> and <b>'s

So the result I'd like to achieve here would be a string:

[28] The Canadian trade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive use of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i> provides:


Solution

  • You can try the following lines of code:

    <?php
    
    $string = '<span lang="EN-CA">[28]<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n
    bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span lang="EN-CA">The Canadian tr
    ade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive u
    se of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i> 
    provides:</span>';
    
    // Remove attributes within the <span> tag, just for clarity's sake.
    $string = preg_replace('/(<span ([^\>]+)>)/i', '<span>', $string);
    
    // Remove any spans that only contain &nbsp;
    $string = preg_replace('/<span>([ ]|&nbsp;)*<\/span>/i', '', $string);
    
    // Replace any consecutive span (opening or closing) tags with a space, to make
    // clear the separation between one span and the next.
    $string = preg_replace('/<(\/)?span><(\/)?span>/i', ' ', $string);
    
    // Remove any remaining any instances of opening or closing span tags.
    $string = preg_replace('/<(\/)?span>/i', '', $string);
    
    print $string;
    

    Note that I added an i after the slash for each regular expression, which gives you a case-insensitive search. That's just in case you have some code that is <SPAN> or <span> or even <SpaN>.

    Granted, it's not a tightly compressed single line of regular expression code awesomeness. But, I did it this way so that you can see the steps along the way. You can put the print $string; line throughout to see the progression. I was hoping this way of demonstrating the code to you would help you, in the long run, to get a better feel for how regular expressions and preg_replace can be used.