Search code examples
scrapyweb-crawlerscrapy-splash

Following urls in javascript - Scrapy Splash


I am extremely new to web scraping. I manage to extract information from static websites but am now trying my hand following urls and extracting data (which ofcourse involves some javascript). I have installed scrapy-splash for the same which is running perfectly fine. The website I am trying to scrape is https://www.ta.com/portfolio/investments/ari-network-services-inc and the button to the top right side takes you to the next page (which is javascript, hence splash). I want to scrape some basic data (like company name, sectors etc) on all the pages till the last one. This is what I have done so far and I need help to correct this to successfully execute.


import scrapy
from scrapy_splash import SplashRequest
import urllib.parse as urlparse


class TAFolio(scrapy.Spider):
    name = 'Portfolio'
    start_urls = ['https://www.ta.com/portfolio/investments/ari-network-services-inc']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback = self.parse, args={"wait" : 3})

    def parse(self, response):

        companyname = response.css('h1.item_detail-main-info-heading::text').extract_first()
        sectors = response.css('.item_detail-main-info-group-item::text')[0].extract()
        investmentyear = response.css('.item_detail-main-info-group-item::text')[1].extract()
        status = response.css('.item_detail-main-info-group-item::text')[2].extract()
        location = response.css('.item_detail-main-info-group-item::text')[3].extract()
        region = response.css('.item_detail-main-info-group-item::text')[4].extract()
        team = response.css('div.item_detail-main-info-group a::text').extract()

        yield {
        'companyname': companyname,
        'sectors': sectors,
        'investmentyear': investmentyear,
        'status': status,
        'location': location,
        'region': region,
        'team': team
        }

        next_page = response.css('li.item_detail-nav-item--next a::attr(href)').extract()


        if next_page is not None:
            yield SplashRequest(urlparse.urljoin('https://www.ta.com',next_page),callback=self.parse, args={"wait":3})

This gives me the correct information for the start_url but doesn't proceed to the next page.


Solution

  • Update. The issue was in the order in which I had the scraping of websites. Below is the updated code which worked well.

    import scrapy
    from scrapy_splash import SplashRequest
    import urllib.parse as urlparse
    
    
    class TAFolio(scrapy.Spider):
        name = 'Portfolio'
        start_urls = [
    
        'https://www.ta.com/portfolio/business-services',
        'https://www.ta.com/portfolio/consumer',
        'https://www.ta.com/portfolio/financial-services',
        'https://www.ta.com/portfolio/healthcare',
        'https://www.ta.com/portfolio/technology'
        ]
    
        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(url=url, callback = self.parse, args={"wait" : 3})
    
        def parse(self, response):
    
            companylink = response.css('div.tiles.js-portfolio-tiles a::attr(href)').extract()
            for i in companylink:
                yield response.follow('https://www.ta.com' + str(i), callback=self.parse1)
    
        def parse1(self, response):
    
            companyname = response.css('h1.item_detail-main-info-heading::text').extract_first()
            sectors = response.css('.item_detail-main-info-group-item::text')[0].extract()
            investmentyear = response.css('.item_detail-main-info-group-item::text')[1].extract()
            status = response.css('.item_detail-main-info-group-item::text')[2].extract()
            location = response.css('.item_detail-main-info-group-item::text')[3].extract()
            region = response.css('.item_detail-main-info-group-item::text')[4].extract()
            team = response.css('div.item_detail-main-info-group a::text').extract()
            about_company = response.css('h2.item_detail-main-content-heading::text').extract()
            about_company_detail = response.css('div.markdown p::text').extract()
    
            yield {
            'companyname': companyname,
            'sectors': sectors,
            'investmentyear': investmentyear,
            'status': status,
            'location': location,
            'region': region,
            'team': team,
            'about_company': about_company,
            'about_company_detail' : about_company_detail
            }