Search code examples
phpxpathdomxpath

xpath: extract data from a node using xpath


I want to extract only the sales rank (which in this case is 5)

Amazon Best Sellers Rank: #5 in Books ( See Top 100 in Books )

From web page : http://www.amazon.com/Mockingjay-Hunger-Games-Book-3/dp/0439023513/ref=tmm_hrd_title_0

So far I have gotten down to this, which selects "Amazon Best Sellers Rank:":

//li[@id='SalesRank']/b/text()

I am using PHP DOMDocument and DOMXPath.


Solution

  • You can use pure XPath:

    substring-before(normalize-space(/html/body//ul/li[@id="SalesRank"]/b[1]/following-sibling::text()[1])," ")
    

    However, if your input is a bit messy you might get more reliable results by using XPath to grab the parent node's text, and then using a regex on the text to get the specific thing you want.

    Demonstration of both methods using PHP with DOMDocument and DOMXPath:

    // Method 1: XPath only
    $xp_salesrank = 'substring-before(normalize-space(/html/body//li[@id="SalesRank"]/b[1]/following-sibling::text()[1])," ")';
    
    // Method 2: XPath and Regex
    $regex_ranktext = 'string(/html/body//li[@id="SalesRank"])';
    $regex_salesrank = '/Best\s+Sellers\s+Rank:\s*(#\d+)\s+/ui';
    
    // Test URLs
    $urls = array(
        'http://rads.stackoverflow.com/amzn/click/0439023513',
        'http://www.amazon.com/Mockingjay-Final-Hunger-Games-ebook/dp/B003XF1XOQ/ref=tmm_kin_title_0?ie=UTF8&m=AG56TWVU5XWC2',
    );
    
    // Results
    $ranks = array();
    $ranks_regex = array();
    
    foreach ($urls as $url) {
        $d = new DOMDocument();
        $d->loadHTMLFile($url);
        $xp = new DOMXPath($d);
    
        // Method 1: use pure xpath
        $ranks[] = $xp->evaluate($xp_salesrank);
    
        // Method 2: use xpath to get a section of text, then regex for more specific item
        // This method is probably more forgiving of bad HTML.
        $rank_regex = '';
        $ranktext = $xp->evaluate($regex_ranktext);
        if ($ranktext) {
            if (preg_match($regex_salesrank, $ranktext, $matches)) {
                $rank_regex = $matches[1];
            }
        }
        $ranks_regex[] = $rank_regex;
    
    }
    
    assert($ranks===$ranks_regex); // Both methods should be the same.
    var_dump($ranks);
    var_dump($ranks_regex);
    

    The output I get is:

    array(2) {
      [0]=>
      string(2) "#4"
      [1]=>
      string(2) "#3"
    }
    array(2) {
      [0]=>
      string(2) "#4"
      [1]=>
      string(2) "#3"
    }