Search code examples
phphtmlweb-scrapinghtml-tablesimple-html-dom

Scrape specific <td> in HTML table


I am trying to scrape a table using PHP, the thing is that I've managed to scrape it, but I get everything on the webpage's table. I am unsure how I specify which TD's and/or TR's I want to scrape.

Here's the PHP code

<?php
include("simple_html_dom.php");
$html=file_get_html("http://www.premierleague.com/en-gb/matchday/league-table.html");
$html=new simple_html_dom($html);

foreach($html->find('table tr') as $row) {
$cell = $row->find('td', 0);
echo $row;
}
?>

What I want to get (if you view the website) is: Club name, played, won, lost, goals for, goals against, goal difference, and points.

What I get is everything in the table, including the collapsed team information. It looks like this (not sure if a picture is the best way to post it but I'm not sure how to show it in another way, I highlighted the part that I actually want scraped):

Picture


Solution

  • Have you tried looking at the advanced usage of Simple HTML DOM Parser?

    I wrote this based on the manual at the link above; it might get you in the right direction:

    require "simple_html_dom.php";
    
    $html=file_get_html("http://www.premierleague.com/en-gb/matchday/league-table.html");
    $html=new simple_html_dom($html);
    
    $rows = array();
    foreach($html->find('table.leagueTable tr.club-row') as $tr){
        $row = array();
        foreach($tr->find('td.col-club,td.col-p,td.col-w,td.col-l,td.col-gf,td.col-ga,td.col-gd,td.col-pts') as $td){
            $row[] = $td->innertext;
        }
        $rows[] = $row;
    }
    var_dump($rows);
    

    Essentially, you want all the <tr> elements which have a class of club-row (adding a . indicates class); furthermore, you only want rows which are nested within the <table> with class leagueTable. That's what the first find is doing. The space after the table indicates you want descendants of it.

    Next, you want <td> elements which have the various classes you mentioned. You can separate these with a comma to mean "and". (Give me td.col-club AND td.col-p AND...)

    The foreach loops are simply walking through those parsed DOM elements and adding their innertext to an array. You can do whatever you like with them after that.