I'm using Scrapy to crawl some pages. I refer to an excel sheet for the start_urls, and I want those exact start urls to appear in the results, rather than the redirected urls. I need the originals in order to process Excel lookups.
The problem is that I only seem to be able to get an output that gives the destination url.
My code is as follows;
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom5.items import ICcom5Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
from scrapy.http import Request
from scrapy.loader import ItemLoader
from scrapy.item import Item, Field
import requests
import csv
import sys
class MySpider(Spider):
name = "ICcom5"
start_urls = [l.strip() for l in open('items5.csv').readlines()]
def parse(self, response):
item = Item()
titles = response.xpath('//div[@class="jobsearch-JobMetadataFooter"]')
items = []
for titles in titles:
item = ICcom5Item()
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = response.request.url
item ['description'] = ' '.join(titles.xpath('//div[@class="jobsearch-jobDescriptionText"]//text()').extract())
item ['role_title_link'] = titles.xpath('//span[@id="originalJobLinkContainer"]/a/@href').extract()
items.append(item)
return items
Pretty simple code, but I'm struggling to understand what I can do from the Scrapy docs.
I have modified the code according to advice but I'm still not getting the original urlsfrom my source spreadsheet. Example urls are as follows;
https://www.indeed.co.uk/rc/clk?jk=a47eb72131f3d588&fccid=c7414b794cb89c1c&vjs=3
https://www.indeed.co.uk/rc/clk?jk=8c7f045caddb116d&fccid=473601b0f30a6c9c&vjs=3
https://www.indeed.co.uk/company/Agilysts-Limited/jobs/Back-End-Java-Developer-3ec6efc3ebc256c5?fccid=d1f7896a8bd9f15e&vjs=3
You can use response.request.url
in the parse
function to get the original URL you requested.
UPDATE: I either understand the documentation wrong or it's a bug. Specifically
HTTP redirections will cause the original request (to the URL before redirection) to be assigned to the redirected response (with the final URL after redirection)
makes me really think that the original request URL should be available under response.request.url
.
Anyway, as stated in the RedirectMiddleware
documentation, there's an alternative way. You can use redirect_urls
key of the request.meta
to get a list of URLs the request goes through. So here's the modified (simplified) version of your code as a PoC:
# -*- coding: utf-8 -*-
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = [
'https://www.indeed.co.uk/rc/clk?jk=a47eb72131f3d588&fccid=c7414b794cb89c1c&vjs=3',
'https://www.indeed.co.uk/rc/clk?jk=8c7f045caddb116d&fccid=473601b0f30a6c9c&vjs=3',
'https://www.indeed.co.uk/company/Agilysts-Limited/jobs/Back-End-Java-Developer-3ec6efc3ebc256c5?fccid=d1f7896a8bd9f15e&vjs=3'
]
def parse(self, response):
for title in response.xpath('//div[@class="jobsearch-JobMetadataFooter"]'):
item = {}
redirect_urls = response.request.meta.get('redirect_urls')
item['_pageURL'] = redirect_urls[0] if redirect_urls else response.request.url
item['description'] = ' '.join(title.xpath('//div[@class="jobsearch-jobDescriptionText"]//text()').extract())
item['role_title_link'] = title.xpath('//span[@id="originalJobLinkContainer"]/a/@href').extract()
yield item
Also, note that there are some other issues with your original code you provided, specifically:
parse
method, you are returning items
which is a list
but only dict
is allowed (or Item
or Request
)for titles in titles:
probably does something you didn't intend