Only leave some tags in a html string using PHP

I am crawling a website with simple_html_dom and need the result that would be somewhere between ->innertext and ->plaintext.

For example, here is the source string:

[28]                          The Canadian trade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive use of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the Trade-marks Act provides:

I need to get rid of the span tags but not their contents (unless the span only contains  's) but retain ,  and 's

So the result I'd like to achieve here would be a string:

[28] The Canadian trade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive use of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the Trade-marks Act provides:

Solution

You can try the following lines of code:

<?php

$string = '<span lang="EN-CA">[28]<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span><span lang="EN-CA">The Canadian tr
ade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive u
se of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i> 
provides:</span>';

// Remove attributes within the <span> tag, just for clarity's sake.
$string = preg_replace('/(<span ([^\>]+)>)/i', '<span>', $string);

// Remove any spans that only contain &nbsp;
$string = preg_replace('/<span>([ ]|&nbsp;)*<\/span>/i', '', $string);

// Replace any consecutive span (opening or closing) tags with a space, to make
// clear the separation between one span and the next.
$string = preg_replace('/<(\/)?span><(\/)?span>/i', ' ', $string);

// Remove any remaining any instances of opening or closing span tags.
$string = preg_replace('/<(\/)?span>/i', '', $string);

print $string;

Note that I added an i after the slash for each regular expression, which gives you a case-insensitive search. That's just in case you have some code that is  or  or even .

Granted, it's not a tightly compressed single line of regular expression code awesomeness. But, I did it this way so that you can see the steps along the way. You can put the print $string; line throughout to see the progression. I was hoping this way of demonstrating the code to you would help you, in the long run, to get a better feel for how regular expressions and preg_replace can be used.