Search code examples
phphtmlparsingsimple-html-dom

Get multiple value from html with dom (without id or classes)


I'm trying to get proxy and port value from this http://jsbin.com/noxuqusoga/edit?html, output html page.

Here is a sample of the table structure from that page, including only one tr, but the actual HTML has many tr elements with similar structure:

<table class="table" id="tbl_proxy_list" width="950">
    <tbody>
        <tr data-proxy-id="1355950">
            <td align="left"><abbr title="103.227.175.125">103.227.175.125 </abbr></td>
            <td align="left"><a href="/proxy-server-list/port-8080/" title="Port 8080 proxies">8080</a></td>
            <td align="left"><time class="icon icon-check timeago" datetime="2018-08-18 04:56:47Z">9 min ago</time></td>
            <td align="left">
            <div class="progress-bar" data-value="22" title="1089">
            <div class="progress-bar-inner" style="width:22%; background-color: hsl(26.4,100%,50%);">&nbsp;</div>
            </div>
            <small>1089 ms</small></td>
            <td style="text-align:center !important;"><span style="color:#009900;">95%</span> <span> (94)</span></td>
            <td align="left"><img alt="sg" class="flag flag-sg" src="/assets/images/blank.gif" style="vertical-align: middle;" /> <a href="/proxy-server-list/country-sg/" title="Proxies from Singapore">Singapore <span class="proxy-city"> - Bukit Timah </span> </a></td>
            <td align="left"><span class="proxy_transparent" style="font-weight:bold; font-size:10px;">Transparent</span></td>
            <td><span>-</span></td>
        </tr>
  </tbody>
</table>

I'm able to scrap the proxy address but I have difficulties with the port as the <td> does not have an id or a class and as value some have hyperlinks, and others don't.

How can I make the result like --> ip:port for the whole scrap result.

Here's my code

$html = file_get_html('http://jsbin.com/noxuqusoga/');

// Find all images
foreach($html->find('abbr') as $element)
       echo $element->title . '<br>';

foreach($html->find('td a') as $element)
       echo $element->plaintext . '<br>';

Please help,
Thanks


Solution

  • Instead of writing a selector for td elements (or elements inside them, like abbr or a) write a selector for their tr parent, then loop over these trs (rows) and for each row, get the children of that row which you need:

    // Select all tr elements inside tbody
    foreach ($html->find('tbody tr') as $row)
        // the second parameter (zero) indicates we only need the first element matching our selector
    
        // ip is in the first <abbr> element that is child of a td
        $ip = $row->find('td abbr', 0)->plaintext;
        // port is in the first <a> element that is child of a td
        $port = $row->find('td a', 0)->plaintext;
        print "$ip:$port\n";
    }
    

    As an alternative, you should know when selecting elements, besides using css selectors you also have the option to get elements by their index. In your case, what you want from each tr is in the first and the second td elements inside each tr element. So you can also find the first and the second child of each tr to extract the data.