Search code examples
phphtmlparsingsimple-html-dom

PHP: Simple HTML Dom parser - Parse HTML table with headers/uneven body rows


I have a HTML table with the below format: As you can see, the first Header 1 has one Row 1 associated with it. The second Header 2 has two rows - Row 2, Row 3 associated with it. Header 3 has Row 4, Row 5, Row 6 associated with it.

<table>
<thead>
    <tr>
        <th>Header 1</th>
    </tr>
</thead>
<tbody>
        <tr>
            <td>
                Row 1
            </td>
        </tr>
</tbody>
<thead>
    <tr>
        <th>Header 2</th>
    </tr>
</thead>
<tbody>
        <tr>
            <td>
                Row 2
            </td>
        </tr>
        <tr>
            <td>
                Row 3
            </td>
        </tr>

</tbody>
<thead>
    <tr>
        <th>Header 3</th>
    </tr>
</thead>
<tbody>
        <tr>
            <td>
                Row 4
            </td>
        </tr>
        <tr>
            <td>
                Row 5
            </td>
        </tr>
        <tr>
            <td>
                Row 6
            </td>
        </tr>
</tbody>

I want to use the PHP Simple HTML Dom parser to get the following data:

Header 1, Row 1
Header 2, Row 2, Row 3
Header 3, Row 4, Row 5, Row 6

When I use the parser to get the tags, all of them are stored in one array. All other tags are stored in another array when I do the foreach loop. How do I preserve the association of the headers with the rows when I am looping?


Solution

  • Without seeing your existing php code it is difficult to say exactly how to change what you have. But something like this would work for your use case:

    //Assuming $html has been set to your html block
    $heads = $html->find('thead');
    $result = array();
    
    foreach($heads as $head){
        $headerText = $head->find('th')[0]->innerText;
        $result[$headerText] = array();
        $rows = $head->next_sibling()->find('td');
        foreach($rows as $row){
            $result[$headerText][] = $row->innerText;
        }
    }
    
    //Output
    foreach($result as $header => $rows){
        echo $header . ': ' . implode(',', $rows);
    }
    

    Some caveats, the above is a simple example of what you want to do. It is a fairly naive implementation. E.g. it assumes that a given thead will only ever have exactly 1 th.

    Also, If echoing it is really all you want to do, it would be more efficient to echo directly in the parsing loop. I separated the output since I assume you want to do more than just print it out to the screen.

    Note, it would be fairly simple to do something like this using the native dom parser, I am assuming you need to use simple html dom for some other reason.