Search code examples
phpscreen-scraping

extract value from web page


Hi I have a website's home page that I am reading in using Curl and I need to grab the number of pages that the site has.

The information is in a div:-

<div class="pager">
<span class="page-numbers current">1</span>
<a href="/users?page=2" title="go to page 2"><span class="page-numbers">2</span></a>
<a href="/users?page=3" title="go to page 3"><span class="page-numbers">3</span></a>
<a href="/users?page=4" title="go to page 4"><span class="page-numbers">4</span></a>
<a href="/users?page=5" title="go to page 5"><span class="page-numbers">5</span></a>
<span class="page-numbers dots">&hellip;</span>

<a href="/users?page=15" title="go to page 15"><span class="page-numbers">15</span></a>
<a href="/users?page=2" title="go to page 2"><span class="page-numbers next"> next</span></a>
</div>

The value I need is 15 but this could be any number depending on the site but will always be in the same position.

How could I read this value easily and assign it to a variable in PHP.

Thanks

Jonathan


Solution

  • You can use PHP's DOM module for that. Read the page with DOMDocument::loadhtmlfile(), then create a DOMXPath object and query all span elements within the document having the class="page-numbers" attribute.

    (edit: oops, that's not what you're looking for, see second code snippet)

    $html = '<html><head><title>:::</title></head><body>
    <div class="pager">
    <span class="page-numbers current">1</span>
    <a href="/users?page=2" title="go to page 2"><span class="page-numbers">2</span></a>
    <a href="/users?page=3" title="go to page 3"><span class="page-numbers">3</span></a>
    <a href="/users?page=4" title="go to page 4"><span class="page-numbers">4</span></a>
    <a href="/users?page=5" title="go to page 5"><span class="page-numbers">5</span></a>
    <span class="page-numbers dots">&hellip;</span>
    
    <a href="/users?page=15" title="go to page 15"><span class="page-numbers">15</span></a>
    <a href="/users?page=2" title="go to page 2"><span class="page-numbers next"> next</span></a>
    </div>
    </body></html>';
    
    $doc = new DOMDocument;
    // since the content "is already here" we use loadhtml(content)
    // instead of loadhtmlfile(url) 
    $doc->loadhtml($html);
    $xpath = new DOMXPath($doc);
    $nodelist = $xpath->query('//span[@class="page-numbers"]');
    echo 'there are ', $nodelist->length, ' span elements having class="page-numbers"';
    

    edit: does this

    <a href="/users?page=15" title="go to page 15"><span class="page-numbers">15</span></a>
    

    (the second last a element) always point to the last page, i.e. does this link contain the value you're looking for?
    Then you can use a XPath expression that selects the second but last a element and from there its child span element.

    //div[@class="pager"] <- select each <div> where the attribute class equals "pager"
    //div[@class="pager"]/a <- select each <a> that is a direct child of the pager div
    //div[@class="pager"]/a[position()=last()-1] <- select the <a> that is second but last
    //div[@class="pager"]/a[position()=last()-1]/span <- select the direct child <span> of that second but last <a> element in the pager <div>
    

    ( you might want to fetch a good XPath tutorial ;-) )

    $doc->loadhtml($html);
    $xpath = new DOMXPath($doc);
    $nodelist = $xpath->query('//div[@class="pager"]/a[position()=last()-1]/span');
    if ( 0 < $nodelist->length ) {
      echo $nodelist->item(0)->nodeValue;
    }
    else {
      echo 'not found';
    }