Ok, so i have to edit this completely. I have got the script partially working, i can start it without any problem, and here is the script. The paste link is here: http://pastebin.com/SKa5Wh1h and you can see what i get in the cmd line. I'm sure the keyword that is being searched is in those links, because i tried other words too, but it's not downloading them.
import scrapy
import requests
from scrapy.http import Request
import scrapy
from FinalSpider.items import Page # Defined in items.py
URL = "http://url.com=%d"
starting_number = 60000
number_of_pages = 100
class FinalSpider(scrapy.Spider):
name = "FinalSpider"
allowed_domains = ['url.com']
start_urls = [URL % starting_number]
def __init__(self):
self.page_number = starting_number
def start_request(self):
# generate page IDs from 1000 down to 501
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i, callback=self.parse)
def parse(self, response):
for link in response.xpath('//a[text()="Amount"]/@href').extract():
yield Page(url=link)
Here you are asking two things,
The xpath that you are providing, response.xpath('//100.00()')
, is an invalid xpath expressions.
If you want to find an a
tag with some subtring in the text, like <a href="something"> 100.00 </a>
, the correct xpath would be '//a[contains(text(), "100.00")]'
. Note the use of contains
, if you have the exact text you could use '//a[text() == "100.00"]'
.
In Scrapy, it's customary to create an Item
class that would hold the data you have scrapped, logically structured by the Field
s you have defined.
So first, you create a Item
subclass, with a url
Field
, and in your spider, return
or yield
a new instance of that Item
with the field url
set to the value you found in the page.
Putting all this together,
You have to create an Item
, as shown here:
import scrapy
class Page(scrapy.Item):
url = scrapy.Field()
Then, in your spider extract all the meaningful data from the response
object. Look at the examples here to get a feeling. But in general your code will be like,
import scrapy
from myproject.items import Page # Defined in items.py
class MySpider(scrapy.Spider):
[...]
def parse(self, response):
for link in response.xpath('//a[text()="100.00"]/@href').extract():
yield Page(url=link)