Search code examples
pythonwebweb-crawlerscrapyscraper

Scrapy recursive website crawl after login


I had coded a spider to crawl website after login

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
from scrapy.selector import Selector
from scrapy.loader import ItemLoader
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor

class LoginSpider(scrapy.Spider):
    name = "login"
    allowed_domains = ["mydomain.com"]
    start_urls = ['https://login.mydomain.com/login']

    rules = [Rule(LinkExtractor(allow=('//a[contains(text(), "Next")]'), restrict_xpaths=('//a[contains(text(), "Previous")]',)), 'parse_info')]

    def parse(self, response):
        return [FormRequest.from_response(response,
            formdata={"username":"myemail","password":"mypassword"},
            callback=self.parse_info, dont_filter=True)]

    def parse_info(self, response):
        items = []
        for tr in range(1, 5):
            xpath = "/html/body/table/tbody/tr[%s]/td[1]/text()" % tr
            td1 = Selector(response=response).xpath(xpath).extract()
            item = MyItem()
            item['col1'] = td1
            items.append(item)

        return items

And the html

<html>
   <table>
       <tbody>
          <tr><td>Row 1</td></tr>
          <tr><td>Row 2</td></tr>
       </tbody>
   </table>
   <div><a href="?page=2">Next</a></div>
   <div><a href="#">Previous</a></div>
</html>

So what the spider does is it automatically login the user from the login page and redirect to the home page with the html above.

Now what I want to achieve is that I want to scrape the next page after the first page using the python script above.

I have read about the Scrapy documentation about Rules implementation but I have no success to make it work. Please help me out I'm stuck on this for over a day now. Thank you.


Solution

  • I have read about the Scrapy documentation about Rules implementation but I have no success to make it work.

    The Rule in your code does not work because you are using the standard Spider (scrapy.Spider) and not the CrawlSpider.

    Keep the standard Spider and implement the pagination manually instead of using the CrawlSpider. Do something like:

    def parse_info(self, response):
        # items = []
        for tr in range(1, 5):
            xpath = "/html/body/table/tbody/tr[%s]/td[1]/text()" % tr
            td1 = Selector(response=response).xpath(xpath).extract()
            item = MyItem()
            item['col1'] = td1
            # items.append(item)
            yield item
        # return items
    
        # If there is a next page, extract href, build request
        # and send it to server
        next_page = response.xpath('//a[contains(text(), "Next")]/@href')
        if next_page:
            next_page_href = next_page.extract()[0]
            next_page_url = response.urljoin(next_page_href)
            request = scrapy.Request(next_page_url, callback=self.parse_info)
            yield request