Search code examples
pythonxpathweb-crawlerscrapy

Scrapy script , how to find specific keyword and return or print url


Ok, so i have to edit this completely. I have got the script partially working, i can start it without any problem, and here is the script. The paste link is here: http://pastebin.com/SKa5Wh1h and you can see what i get in the cmd line. I'm sure the keyword that is being searched is in those links, because i tried other words too, but it's not downloading them.

import scrapy

import requests
from scrapy.http import Request

import scrapy
from FinalSpider.items import Page  # Defined in items.py

URL = "http://url.com=%d"
starting_number = 60000
number_of_pages = 100
class FinalSpider(scrapy.Spider):
name = "FinalSpider"
allowed_domains = ['url.com']
start_urls = [URL % starting_number]

def __init__(self):
    self.page_number = starting_number
def start_request(self):
    # generate page IDs from 1000 down to 501
    for i in range (self.page_number, number_of_pages, -1):
        yield Request(url = URL % i, callback=self.parse)


def parse(self, response):
    for link in response.xpath('//a[text()="Amount"]/@href').extract():
        yield Page(url=link)

Solution

  • Here you are asking two things,

    1. How to extract some element?

    The xpath that you are providing, response.xpath('//100.00()'), is an invalid xpath expressions.

    If you want to find an a tag with some subtring in the text, like <a href="something"> 100.00 </a>, the correct xpath would be '//a[contains(text(), "100.00")]'. Note the use of contains, if you have the exact text you could use '//a[text() == "100.00"]'.

    1. What do you with the found element?

    In Scrapy, it's customary to create an Item class that would hold the data you have scrapped, logically structured by the Fields you have defined.

    So first, you create a Item subclass, with a url Field, and in your spider, return or yield a new instance of that Item with the field url set to the value you found in the page.

    Putting all this together,

    You have to create an Item, as shown here:

    import scrapy
    
    class Page(scrapy.Item):
        url = scrapy.Field()
    

    Then, in your spider extract all the meaningful data from the response object. Look at the examples here to get a feeling. But in general your code will be like,

    import scrapy
    from myproject.items import Page  # Defined in items.py
    
    class MySpider(scrapy.Spider):
        [...]
    
        def parse(self, response):
            for link in response.xpath('//a[text()="100.00"]/@href').extract():
                yield Page(url=link)