Search code examples
pythonscrapyexport-to-csv

Include original URL from Excel sheet in scrapy output


I'm using Scrapy to crawl some pages. I refer to an excel sheet for the start_urls, and I want those exact start urls to appear in the results, rather than the redirected urls. I need the originals in order to process Excel lookups.

The problem is that I only seem to be able to get an output that gives the destination url.

My code is as follows;

from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom5.items import ICcom5Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
from scrapy.http import Request
from scrapy.loader import ItemLoader
from scrapy.item import Item, Field
import requests
import csv
import sys

class MySpider(Spider):
    name = "ICcom5"
    start_urls = [l.strip() for l in open('items5.csv').readlines()]

    def parse(self, response):
        item = Item()
        titles = response.xpath('//div[@class="jobsearch-JobMetadataFooter"]')
        items = []
        for titles in titles:
            item = ICcom5Item()
            home_url = ("http://www.indeed.co.uk")
            item ['_pageURL'] = response.request.url
            item ['description'] = ' '.join(titles.xpath('//div[@class="jobsearch-jobDescriptionText"]//text()').extract())
            item ['role_title_link'] = titles.xpath('//span[@id="originalJobLinkContainer"]/a/@href').extract()
            items.append(item)
        return items

Pretty simple code, but I'm struggling to understand what I can do from the Scrapy docs.


I have modified the code according to advice but I'm still not getting the original urlsfrom my source spreadsheet. Example urls are as follows;

https://www.indeed.co.uk/rc/clk?jk=a47eb72131f3d588&fccid=c7414b794cb89c1c&vjs=3
https://www.indeed.co.uk/rc/clk?jk=8c7f045caddb116d&fccid=473601b0f30a6c9c&vjs=3
https://www.indeed.co.uk/company/Agilysts-Limited/jobs/Back-End-Java-Developer-3ec6efc3ebc256c5?fccid=d1f7896a8bd9f15e&vjs=3

Solution

  • You can use response.request.url in the parse function to get the original URL you requested.

    UPDATE: I either understand the documentation wrong or it's a bug. Specifically

    HTTP redirections will cause the original request (to the URL before redirection) to be assigned to the redirected response (with the final URL after redirection)

    makes me really think that the original request URL should be available under response.request.url.

    Anyway, as stated in the RedirectMiddleware documentation, there's an alternative way. You can use redirect_urls key of the request.meta to get a list of URLs the request goes through. So here's the modified (simplified) version of your code as a PoC:

    # -*- coding: utf-8 -*-
    import scrapy
    
    class MySpider(scrapy.Spider):
        name = 'myspider'
        start_urls = [
            'https://www.indeed.co.uk/rc/clk?jk=a47eb72131f3d588&fccid=c7414b794cb89c1c&vjs=3',
            'https://www.indeed.co.uk/rc/clk?jk=8c7f045caddb116d&fccid=473601b0f30a6c9c&vjs=3',
            'https://www.indeed.co.uk/company/Agilysts-Limited/jobs/Back-End-Java-Developer-3ec6efc3ebc256c5?fccid=d1f7896a8bd9f15e&vjs=3'
        ]
    
        def parse(self, response):
            for title in response.xpath('//div[@class="jobsearch-JobMetadataFooter"]'):
                item = {}
                redirect_urls = response.request.meta.get('redirect_urls')
                item['_pageURL'] = redirect_urls[0] if redirect_urls else response.request.url
                item['description'] = ' '.join(title.xpath('//div[@class="jobsearch-jobDescriptionText"]//text()').extract())
                item['role_title_link'] = title.xpath('//span[@id="originalJobLinkContainer"]/a/@href').extract()
                yield item
    

    Also, note that there are some other issues with your original code you provided, specifically:

    • in your parse method, you are returning items which is a list but only dict is allowed (or Item or Request)
    • the line for titles in titles: probably does something you didn't intend