Search code examples
phpdomdomdocumentfile-get-contentsdomxpath

Extract html content using php


I have the following code:

$html = file_get_contents("http://www.jabong.com/giordano-Dtlm60058-Black-Analog-Watch-267058.html");

$dom = new DOMDocument();


$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//*[@id="price_div"]/div[2]/span[2]');  //this catches all elements with 
var_dump($nodes); 

I want to extract the price from the page. But this xpath is not giving me the result.


Solution

  • Did you ever solve the problem? Here is some working code :

    $html = file_get_contents("http://www.jabong.com/giordano-Dtlm60058-Black-Analog-Watch-267058.html");
    
    //suppress errors (there is a lot on the page in question)
    libxml_use_internal_errors(true);
    
    //dont preserve whitespaces
    $page->preserveWhiteSpace = false;
    
    $dom = new DOMDocument();
    //as @Larry.Z comments, you forgot to load the $html
    $dom->loadHTML($html);
    
    $xpath = new DOMXPath($dom);
    
    //assuming there can be more than one "price set" on each page
    $prices = array();
    
    $price_divs = $xpath->query('//div[@id="price_div"]');
    foreach ($price_divs as $price_div) {
        $price=array();
        foreach ($price_div->childNodes as $price_item) {
            $content=trim($price_item->textContent);
            if ($content!='') $price[]=$content;
        } 
        $prices[]=$price;
    }
    
    echo '<pre>';
    print_r($prices);
    echo '</pre>';
    

    outputs

    Array
    (
        [0] => Array
            (
                [0] => Save 66%
                [1] => Rs. 5850
                [2] => Rs. 1999
            )
    
    )
    

    you can skip the $prices[] part and only use $price if there never will be more than one price set per page.