Search code examples
phphtmlsimple-html-dom

How to get only first certain tags with PHP Simple HTML DOM Parser


I am trying to get first 3 tags texts using the PHP Simple HTML DOM Parser and collecting those in array.

The table is like:

<table>
    <tbody>
        <tr>
            <td>Floyd</td>
            <td>Machine</td>
            <td>Banking</td>
            <td>HelpScout</td>
        </tr>
        <tr>
            <td>Nirvana</td>
            <td>Paper</td>
            <td>Business</td>
            <td>GuitarTuna</td>
        </tr>
        <tr>
            <td>The edge</td>
            <td>Tree</td>
            <td>Hospital</td>
            <td>Sician</td>
        </tr>

        .....
        .....
    </tbody>
</table>

What I am trying to achieve is collect these in arrays excluding the 4th td of the tr tag:

array(
   array(
      'art' => 'Floyd',
      'thing' => 'machine',
      'passion' => 'Banking',
   ),
   array(
      'art' => 'Nirvana',
      'thing' => 'Paper',
      'passion' => 'Business',
   ),
   array(
      'art' => 'The edge',
      'thing' => 'Tree',
      'passion' => 'Hospital',
   ),
);

This is what I have tried is:

require_once dirname( __FILE__ ) . '/library/simple_html_dom.php';

$html    = file_get_html( 'https://www.example.com/list.html' );
$collect = array();
$list    = $html->find( 'table tbody tr td' );

foreach( $list as $l ) {
    $collect[] = $l->plaintext;
}

$html->clear();
unset($html);

print_r($collect);

Which is giving all the tds in array and it's being difficult to identify the array keys which I require. Is there any solution for me?


Solution

  • Instead of iterating over all td elements at once, you can iterate over each tr and for each tr, iterate over inner td elements and skip the 4th td:

    $htmlString =<<<html
    <table>
        <tbody>
            <tr>
                <td>Floyd</td>
                <td>Machine</td>
                <td>Banking</td>
                <td>HelpScout</td>
            </tr>
            <tr>
                <td>Nirvana</td>
                <td>Paper</td>
                <td>Business</td>
                <td>GuitarTuna</td>
            </tr>
            <tr>
                <td>The edge</td>
                <td>Tree</td>
                <td>Hospital</td>
                <td>Sician</td>
            </tr>
        </tbody>
    </table>
    html;
    $html = str_get_html($htmlString);
    
    // find all tr tags
    $trs = $html->find('table tr');
    $collect = [];
    
    // foreach tr tag, find its td children
    foreach ($trs as $tr) {
        $tds = $tr->find('td');
        // collect first 3 children and skip the 4th
        $collect []= [
            'art' => $tds[0]->plaintext,
            'thing' => $tds[1]->plaintext,
            'passion' => $tds[2]->plaintext,
        ];
    }
    print_r($collect); 
    

    the output is:

    Array
    (
        [0] => Array
            (
                [art] => Floyd
                [thing] => Machine
                [passion] => Banking
            )
    
        [1] => Array
            (
                [art] => Nirvana
                [thing] => Paper
                [passion] => Business
            )
    
        [2] => Array
            (
                [art] => The edge
                [thing] => Tree
                [passion] => Hospital
            )
    
    )