I am crawling a website with simple_html_dom and need the result that would be somewhere between ->innertext and ->plaintext.
For example, here is the source string:
<span lang="EN-CA">[28]<span style="font:7.0pt "Times New Roman"">
</span></span><span lang="EN-CA">The Canadian trade-marks regime is national in
scope. The owner of a registered trade-mark, subject to a finding of
invalidity, is entitled to the exclusive use of that mark in association with
the wares or services to which it is connected throughout Canada. Section 19 of
the <i>Trade-marks Act</i> provides:</span>
I need to get rid of the span
tags but not their contents (unless the span
only contains
's) but retain <i>
, <u>
and <b>
's
So the result I'd like to achieve here would be a string:
[28] The Canadian trade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive use of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i> provides:
You can try the following lines of code:
<?php
$string = '<span lang="EN-CA">[28]<span style="font:7.0pt "Times New Roman""> &n
bsp; </span></span><span lang="EN-CA">The Canadian tr
ade-marks regime is national in scope. The owner of a registered trade-mark, subject to a finding of invalidity, is entitled to the exclusive u
se of that mark in association with the wares or services to which it is connected throughout Canada. Section 19 of the <i>Trade-marks Act</i>
provides:</span>';
// Remove attributes within the <span> tag, just for clarity's sake.
$string = preg_replace('/(<span ([^\>]+)>)/i', '<span>', $string);
// Remove any spans that only contain
$string = preg_replace('/<span>([ ]| )*<\/span>/i', '', $string);
// Replace any consecutive span (opening or closing) tags with a space, to make
// clear the separation between one span and the next.
$string = preg_replace('/<(\/)?span><(\/)?span>/i', ' ', $string);
// Remove any remaining any instances of opening or closing span tags.
$string = preg_replace('/<(\/)?span>/i', '', $string);
print $string;
Note that I added an i
after the slash for each regular expression, which gives you a case-insensitive search. That's just in case you have some code that is <SPAN>
or <span>
or even <SpaN>
.
Granted, it's not a tightly compressed single line of regular expression code awesomeness. But, I did it this way so that you can see the steps along the way. You can put the print $string;
line throughout to see the progression. I was hoping this way of demonstrating the code to you would help you, in the long run, to get a better feel for how regular expressions and preg_replace
can be used.