What I'm trying to do is take a block of html, strip out all the html tags, and put each line of text into a PHP array.
I'm just trying it with one block to test (hence the WHERE ID = '2409'
in my mysql query.
The HTML portion for ID
2409
looks like this:
<table class="description-table">
<tbody>
<tr><td>Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9</td></tr>
<tr><td>Description</td></tr>
<tr><td></td>
<td><br>
<br><p></p><p></p>
<strong><br></strong> <strong><br></strong> <strong>Donec Rem </strong><br>
<br>
<strong>Animam Urgebat<br>
<br></strong> <strong><br>
<br>
Rerum Sed 8613 - 3669 8358 & 6699<br>
<br>
1.mE (magNA) QUO Ad Nominum Statum Massa<br>
ab SEM Autem Reddet Habitu Sit<br>
<br></strong> <strong> PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM</strong> <strong><br></strong> <strong><br></strong> <strong>Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!economy!</strong><strong><br></strong><strong> ad Quisque Modeste</strong><strong> ac Rem Wisi</strong><strong> ex Hac Congue mus Leo</strong><strong> ab 7/92" Alias</strong><strong> ad 2/73" Adverso & Erat</strong><strong> me Personom Eget</strong><strong> ad Viribus Fuga Fuga</strong><strong> ab Louor-Sit Molles</strong><strong class="c2"> 3x Block-Off Plates</strong><strong class="c2"> ad Facunda</strong><strong class="c2"> ab Personas Diam<br>
NUNC<br>
ex Teniet te Palmam Eaque<br>
me Teniet in Versus Urna<br></strong> <strong><br></strong><br>
<strong class="c3">**CONDEMNENDUS REM CUM MAGNORUM**</strong><strong></strong><br>
</td>
</table>
And here's my PHP script designed to parse this
//connect to mysqli
$results = $mysqli->query("SELECT ID, post_content
FROM wp_posts'
WHERE ID = '2409';");
while($row = $results->fetch_array()) {
$htmlarray2 = preg_split('/<.+?>/', $row['post_content']);
$htmlarray = array_values(array_filter(array_map('trim', $htmlarray2)));
echo '<pre>';
print_r($htmlarray);
echo '</pre>';
. . .
}
This produces an output like this
Array
(
[0] => Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9
[1] => Donec Rem
[2] => Animam Urgebat
[3] => Rerum Sed 8613 - 3669 8358 & 6699
[4] => 1.mE (magNA) QUO Ad Nominum Statum Massa
[5] => ab SEM Autem Reddet Habitu Sit
[6] => PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM
[7] => Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!
[8] => ad Quisque Modeste
[9] => ac Rem Wisi
[10] => ex Hac Congue mus Leo
[11] => ab 7/92" Alias
[12] => ad 2/73" Adverso & Erat
[13] => me Personom Eget
[14] => ad Viribus Fuga Fuga
[15] => ea Totam Poenam
[16] => ab Louor-Sit Molles
[17] => ad Facunda
[18] => ab Personas Diam
[19] => NUNC
[20] => ex Teniet te Palmam Eaque
[21] => me Teniet in Versus Urna
[22] => **CONDEMNENDUS REM CUM MAGNORUM**
)
This is okay, but now I'm having issue with removing the white-spaces before and after the strings in the array.
Let's take an example for the node 8
in the array
. . .
$arrayvalue = $htmlarray2['8'];
which echoes like this
ad Quisque Modeste
Now, what I'm trying to do is obviously trim each element of the array, but for testing I'm just working with this one variable $arrayvalue
.
My issue is that trim()
isn't working with this MySQL fetched variable. Meaning adding trim($arrayvalue);
has no affect and echoes out the same way as above.
I know this is something to do with me fetching the array via my query, because if I just test this variable out normally in it's own php script
$string = ' ad Quisque Modeste ';
echo trim($string);
It works fine, and echo outputs just simply ad Quisque Modeste
with the desired no white-spaces before or after the string.
Why isn't trim()
working in my while
loop?
What's the trick to trimming the leading and trailing white-spaces from the elements?
Edit: Here's my full while loop as requested. It's a bit different then the above example (I've been doing a lot of modifications trying to solve this myself so it's constantly changing), but here is what I have right now in full:
while($row = $results->fetch_array()) {
$id = $row['ID'];
echo 'ID: ' . $id;
echo '<br />';
//replace with white space
$converted = strtr($row['post_content'],array_flip(get_html_translation_table(HTML_ENTITIES, ENT_QUOTES)));
trim($converted, chr(0xC2).chr(0xA0));
//remove html elements
$htmlarray = preg_split('/<.+?>/', $converted);
// remove empty array elements and re-index array
$htmlarray2 = array_values(array_filter(array_map('trim', $htmlarray)));
// test by getting single value from array
$arrayvalue = $htmlarray2['9'];
// my attempt to trim string in while loop
trim($arrayvalue);
// doesn't trim
echo '<hr>' . $arrayvalue . '<hr>';
// put this here so I can see the full array
echo '<pre>';
print_r($htmlarray2);
echo '</pre>';
}
As requested, here is the results of var_export($row['post_content']);
'<table class="product-description-table">
<tbody>
<tr>
<td class="item" colspan="3">Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9</td>
</tr>
<tr>
<td class="title" colspan="3"></td>
</tr>
<tr>
<td class="content"><br>
<br>
<p class="c1"></p>
<p class="c1"></p>
<strong><br></strong> <strong><br></strong> <strong>Donec Rem </strong><br>
<br>
<strong>Animam Urgebat<br>
<br></strong> <strong><br>
<br>
Rerum Sed 8613 - 3669 8358 & 6699<br>
<br>
1.mE (magNA) QUO Ad Nominum Statum Massa<br>
ab SEM Autem Reddet Habitu Sit<br>
<br></strong> <strong> PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM</strong> <strong><br></strong> <strong><br></strong> <strong>Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!economy!</strong><strong><br></strong><strong> ad Quisque Modeste</strong><strong> ac Rem Wisi</strong><strong> ex Hac Congue mus Leo</strong><strong> ab 7/92" Alias</strong><strong> ad 2/73" Adverso & Erat</strong><strong> me Personom Eget</strong><strong> ad Viribus Fuga Fuga</strong><strong> ab Louor-Sit Molles</strong><strong class="c2"> 3x Block-Off Plates</strong><strong class="c2"> ad Facunda</strong><strong class="c2"> ab Personas Diam<br>
NUNC<br>
ex Teniet te Palmam Eaque<br>
me Teniet in Versus Urna<br></strong> <strong><br></strong><br>
<strong class="c3">**CONDEMNENDUS REM CUM MAGNORUM**</strong><strong> </strong><br></td>
<td class="product-content-border"></td>
</tr>
<tr>
<td class="gallery" colspan="3">
<table>
<tbody>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td class="spacer" colspan="3"></td>
</tr>
<tr>
<td class="product-content-border"></td>
</tr>
</tbody>
</table>
<br>
<br>
<br>
<p class="c4"></p>'
Final Edit :):
Posted a solution below. Not going to accept my own answer.
If anyone familiar with regex can help explain the tribulation behind all this and why this regex formula : /[\s]+/mu
or rather $clean_htmlarray = preg_replace('/[\s]+/mu', ' ', $htmlarray);
fixed this issue I'll gladly accept that as a proper answer and explanation.
Here's your requested explanation on the regex pattern that solved your issue:
/[\s]+/
(more simply expressed as /\s+/
) says "look for one or more white-space characters (this includes:
' ','\r','\n','\t','\f','\v'). The multi-line
modifier/flag is not necessary because you are not using anchors (^
$
) in your pattern. The unicode
modifier/flag is absolutely critical in your case because your string of html text contains many little devils called...
"NO-BREAK SPACE" and is a combination of unicode characters
194
and160
represented as\x{00A0}
See them highlighted here.
Without the u
flag, the NO-BREAK SPACE
characters remain and additional filtering will be required to remove them.
While you eventually got your code to the right output. I'm happy to produce a leaner single-step pattern that will get you there faster purely using preg_split().
while ($row = $results->fetch_array()) {
$texts = preg_split('/\s*<[^>]+>\s*/u', $row['post_content'], 0, PREG_SPLIT_NO_EMPTY);
var_export($texts);
}
Here is a working regex101 demo.
This new splitting pattern still looks for your tags, but it is more efficient because between the <
and >
, I merely ask to match all characters that are "not >
" by using [^>]+
. This is much simpler for the engine versus asking to match from the long list of characters that .
represents.
Furthermore, I included matching for your unicode-extended white-space characters. \s*
will match zero or more white-space characters before AND after each tag.
Finally, I should explain the additional parameters on preg_split()
. The 0
says "find unlimited matches" -- this is the default behavior, but I must use 0
or -1
as its value to hold its place to ensure that the final parameter is used. PREG_SPLIT_NO_EMPTY
spares you having to take the extra step of using array_filter()
later. It omits any empty elements generated from the split, so you only get the good stuff.