php html unicode preg-replace removing-whitespace

Trim Not Working with Array from MySQL fetched String

What I'm trying to do is take a block of html, strip out all the html tags, and put each line of text into a PHP array.

I'm just trying it with one block to test (hence the WHERE ID = '2409' in my mysql query.

The HTML portion for ID 2409 looks like this:

<table class="description-table">
<tbody>
<tr><td>Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9</td></tr>
<tr><td>Description</td></tr>
<tr><td></td>
<td><br>
<br><p></p><p></p>
<strong><br></strong> <strong><br></strong> <strong>Donec Rem </strong><br>
<br>
<strong>Animam Urgebat<br>
<br></strong> <strong><br>
<br>
Rerum Sed 8613 - 3669 8358 & 6699<br>
<br>
1.mE (magNA) QUO Ad Nominum Statum Massa<br>
ab SEM Autem Reddet Habitu Sit<br>
<br></strong> <strong> PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM</strong> <strong><br></strong> <strong><br></strong> <strong>Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!economy!</strong><strong><br></strong><strong>                                                           ad Quisque Modeste</strong><strong>                                                           ac Rem Wisi</strong><strong>                                                           ex Hac Congue mus Leo</strong><strong>                                                           ab 7/92" Alias</strong><strong>                                                           ad 2/73" Adverso & Erat</strong><strong>                                                           me Personom Eget</strong><strong>                                                           ad Viribus Fuga Fuga</strong><strong>                                                           ab Louor-Sit Molles</strong><strong class="c2">                                                           3x Block-Off Plates</strong><strong class="c2">                                                           ad Facunda</strong><strong class="c2">                                                           ab Personas Diam<br>
NUNC<br>
ex Teniet te Palmam Eaque<br>
me Teniet in Versus Urna<br></strong> <strong><br></strong><br>
<strong class="c3">**CONDEMNENDUS REM CUM MAGNORUM**</strong><strong></strong><br>
</td>
</table>

And here's my PHP script designed to parse this

//connect to mysqli

$results = $mysqli->query("SELECT ID, post_content
FROM wp_posts'
WHERE ID = '2409';");

while($row = $results->fetch_array()) {
    $htmlarray2 = preg_split('/<.+?>/', $row['post_content']);
    $htmlarray = array_values(array_filter(array_map('trim', $htmlarray2)));
    echo '<pre>';
        print_r($htmlarray);
    echo '</pre>';
    . . . 
}

This produces an output like this

Array
(
[0] => Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9
[1] => Donec Rem 
[2] => Animam Urgebat
[3] => Rerum Sed 8613 - 3669 8358 & 6699
[4] => 1.mE (magNA) QUO Ad Nominum Statum Massa
[5] => ab SEM Autem Reddet Habitu Sit
[6] =>  PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM
[7] => Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!
[8] =>                                                            ad Quisque Modeste
[9] =>                                                            ac Rem Wisi
[10] =>                                                            ex Hac Congue mus Leo
[11] =>                                                            ab 7/92" Alias
[12] =>                                                            ad 2/73" Adverso & Erat
[13] =>                                                            me Personom Eget
[14] =>                                                            ad Viribus Fuga Fuga
[15] =>                                                            ea Totam Poenam
[16] =>                                                            ab Louor-Sit Molles
[17] =>                                                            ad Facunda
[18] =>                                                            ab Personas Diam
[19] => NUNC
[20] => ex Teniet te Palmam Eaque
[21] => me Teniet in Versus Urna
[22] => **CONDEMNENDUS REM CUM MAGNORUM**
)

This is okay, but now I'm having issue with removing the white-spaces before and after the strings in the array.

Let's take an example for the node 8 in the array

. . .
$arrayvalue = $htmlarray2['8'];

which echoes like this

                                                       ad Quisque Modeste

Now, what I'm trying to do is obviously trim each element of the array, but for testing I'm just working with this one variable $arrayvalue.

My issue is that trim() isn't working with this MySQL fetched variable. Meaning adding trim($arrayvalue); has no affect and echoes out the same way as above.

I know this is something to do with me fetching the array via my query, because if I just test this variable out normally in it's own php script

$string = '                                                            ad Quisque Modeste  ';
echo trim($string);

It works fine, and echo outputs just simply ad Quisque Modeste with the desired no white-spaces before or after the string.

Why isn't trim() working in my while loop? What's the trick to trimming the leading and trailing white-spaces from the elements?

Edit: Here's my full while loop as requested. It's a bit different then the above example (I've been doing a lot of modifications trying to solve this myself so it's constantly changing), but here is what I have right now in full:

while($row = $results->fetch_array()) {
    $id = $row['ID'];
    echo 'ID: ' . $id;
    echo '<br  />';

    //replace &nbsp; with white space
    $converted = strtr($row['post_content'],array_flip(get_html_translation_table(HTML_ENTITIES, ENT_QUOTES))); 
    trim($converted, chr(0xC2).chr(0xA0));

    //remove html elements
    $htmlarray = preg_split('/<.+?>/', $converted);

    // remove empty array elements and re-index array
    $htmlarray2 = array_values(array_filter(array_map('trim', $htmlarray)));

    // test by getting single value from array
    $arrayvalue = $htmlarray2['9'];

    // my attempt to trim string in while loop
    trim($arrayvalue);

    // doesn't trim
    echo '<hr>' . $arrayvalue . '<hr>';

    // put this here so I can see the full array
    echo '<pre>';
        print_r($htmlarray2);
    echo '</pre>';
}

As requested, here is the results of var_export($row['post_content']);

'<table class="product-description-table">
<tbody>
<tr>
<td class="item" colspan="3">Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9</td>
</tr>
<tr>
<td class="title" colspan="3"></td>
</tr>
<tr>
<td class="content"><br>
<br>
<p class="c1"></p>
<p class="c1"></p>
<strong><br></strong> <strong><br></strong> <strong>Donec Rem&nbsp;</strong><br>
<br>
<strong>Animam Urgebat<br>
<br></strong> <strong><br>
<br>
Rerum Sed 8613 - 3669 8358 & 6699<br>
<br>
1.mE (magNA) QUO Ad Nominum Statum Massa<br>
ab SEM Autem Reddet Habitu Sit<br>
<br></strong> <strong>&nbsp;PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM</strong> <strong><br></strong> <strong><br></strong> <strong>Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!economy!</strong><strong><br></strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ad Quisque Modeste</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ac Rem Wisi</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ex Hac Congue mus Leo</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ab 7/92" Alias</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ad 2/73" Adverso & Erat</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;me Personom Eget</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ad Viribus Fuga Fuga</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ab Louor-Sit Molles</strong><strong class="c2">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;3x Block-Off Plates</strong><strong class="c2">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ad Facunda</strong><strong class="c2">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ab Personas Diam<br>
NUNC<br>
ex Teniet te Palmam Eaque<br>
me Teniet in Versus Urna<br></strong> <strong><br></strong><br>
<strong class="c3">**CONDEMNENDUS REM CUM MAGNORUM**</strong><strong>&nbsp;</strong><br></td>
<td class="product-content-border"></td>
</tr>
<tr>
<td class="gallery" colspan="3">
<table>
<tbody>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td class="spacer" colspan="3"></td>
</tr>
<tr>
<td class="product-content-border"></td>
</tr>
</tbody>
</table>
<br>
<br>
<br>
<p class="c4"></p>'

Final Edit :):

Posted a solution below. Not going to accept my own answer.

If anyone familiar with regex can help explain the tribulation behind all this and why this regex formula : /[\s]+/mu or rather $clean_htmlarray = preg_replace('/[\s]+/mu', ' ', $htmlarray); fixed this issue I'll gladly accept that as a proper answer and explanation.

Solution

Here's your requested explanation on the regex pattern that solved your issue:

/[\s]+/ (more simply expressed as /\s+/) says "look for one or more white-space characters (this includes: ' ','\r','\n','\t','\f','\v'). The multi-line modifier/flag is not necessary because you are not using anchors (^ $) in your pattern. The unicode modifier/flag is absolutely critical in your case because your string of html text contains many little devils called...

"NO-BREAK SPACE" and is a combination of unicode characters 194 and 160 represented as \x{00A0} See them highlighted here.

Without the u flag, the NO-BREAK SPACE characters remain and additional filtering will be required to remove them.

While you eventually got your code to the right output. I'm happy to produce a leaner single-step pattern that will get you there faster purely using preg_split().

while ($row = $results->fetch_array()) {
    $texts = preg_split('/\s*<[^>]+>\s*/u', $row['post_content'], 0, PREG_SPLIT_NO_EMPTY);
    var_export($texts);
}

Here is a working regex101 demo.

This new splitting pattern still looks for your tags, but it is more efficient because between the < and >, I merely ask to match all characters that are "not >" by using [^>]+. This is much simpler for the engine versus asking to match from the long list of characters that . represents.

Furthermore, I included matching for your unicode-extended white-space characters. \s* will match zero or more white-space characters before AND after each tag.

Finally, I should explain the additional parameters on preg_split(). The 0 says "find unlimited matches" -- this is the default behavior, but I must use 0 or -1 as its value to hold its place to ensure that the final parameter is used. PREG_SPLIT_NO_EMPTY spares you having to take the extra step of using array_filter() later. It omits any empty elements generated from the split, so you only get the good stuff.