Search code examples
phpsimple-html-dom

Simple PHP DOM Parser adds unwanted spaces in plaintext


I'm using PHP Simple HTML Dom Parser to extract cell values content off an HTML table and store them in an array.

HTML:

<td class="inflexion">so<span class="deviation">y</span></td>
<td class="inflexion"><span class="deviation">fui</span></td>
<td class="inflexion"><span class="deviation">er</span>a</td>
<td class="inflexion">haber sería</td>

Desired output:

soy

fui

era

haber sería

PHP:

function getvariations($conjtables){
    $conjtables = str_get_html($conjtables);
    $variations = [];
    foreach ($conjtables->find('td[class=inflexion]') as $inflexion) {
        $variations[] = $inflexion->plaintext;
    }
    return array_unique($variations);
}
$variations = getvariations($conjtables);
foreach ($variations as $variation) {
    echo $variation . '<br>';
}

This works, however, the output seems to prepend some occurrences of the span element with an undesired space (see third item below):

soy

fui

er a

haber sería

Any suggestions around fixing this? I cannot remove spaces arbitrarily because some cells happen to genuinely have multiple words as in the last item in the example given.


Solution

  • Use innertext with strip_tags instead of plaintext:

    function getvariations($conjtables){
        $conjtables = str_get_html($conjtables);
        $variations = [];
        foreach ($conjtables->find('td[class=inflexion]') as $inflexion) {
            $variations[] = strip_tags($inflexion->innertext);
        }
        return array_unique($variations);
    }
    $variations = getvariations($conjtables);
    foreach ($variations as $variation) {
        echo $variation . '<br>';
    }
    

    Output:

    soy

    fui

    era

    haber sería