Search code examples
pythonweb-scrapingscrapyxmlhttprequest

Using scrapy crawler to extract Json Data?


I'm trying to scrape product data that happens to be in an XHR request. I am able to scrape the desired data if I absolutely reference the XHR url. But the site I am trying to scrape has a different XHR request for each product page crawled.

Here is a product https://www.midwayusa.com/product/939287480?pid=598174 Now I did notice that if you take the url of each page and put [data] https://www.midwayusa.com/productdata/939287480?pid=598174 you can get the XHR request that way. I don't know how to do that with a crawler being my second scraper and new to python.

So basically what would we be the easiest way to get the JSON data from each page crawled?

class PwspiderSpider(CrawlSpider):
name = 'pwspider'
allowed_domains = ['midwayusa.com']
start_urls = ['https://www.midwayusa.com/s?searchTerm=backpack']

# restricting css
le_backpack_title = LinkExtractor(restrict_css='li.product')

# Callback to ParseItem backpack and follow the parsed URL Links from URL
rule_Backpack_follow = Rule(le_backpack_title, callback='parse_item', follow=False)

# Rules set so Bot can't leave URL
rules = (
    rule_Backpack_follow,
)

def start_requests(self):
    yield scrapy.Request('https://www.midwayusa.com/s?searchTerm=backpack',
        meta={'playwright': True})


def parse_item(self, response):
    data = json.loads(response.body)
    yield from data['products']

enter image description here


Solution

  • I tested page and it uses JavaScript to generate page with search results but it doesn't get data from other url - it has all information directly in HTML as

    <script> 
        window.icvData = {...} 
    </script>
    

    And the same is with product pages. They also have data directly in HTML.

    Sometimes they may have extra line with window.icvData.firstSaleItemId = ...
    but I skip this information.

    import scrapy
    import json
    from scrapy.spiders import Spider
    
    class PwspiderSpider(Spider):
    
        name = 'pwspider'
        
        allowed_domains = ['midwayusa.com']
        
        start_urls = ['https://www.midwayusa.com/s?searchTerm=backpack']
        
        
        def parse(self, response):
            print('url:', response.url)
            
            script = response.xpath('//script[contains(text(), "window.icvData")]/text()').get()
            #print(script)
            
            text = script.split("window.icvData = ")[-1].split('\n')[0].strip()
    
            try:
                data = json.loads(text)
            except Exception as ex:
                print('Exception:', ex)
                print(text)
                return
            
            #print(data["searchResult"].keys())
            
            products = data["searchResult"]['products']
            
            for item in products:
                #print(item)
                colors = [color['name'] for color in item['swatches']]
                print(item['description'], colors)
                yield response.follow(item['link'], callback=self.parse_product, cb_kwargs={'colors': colors})
            
        def parse_product(self, response, colors):
            print('url:', response.url)
            
            script = response.xpath('//script[contains(text(), "window.icvData")]/text()').get()
            #print(script)
            
            # I uses `.split('\n')[0]` because sometimes it may have second line with `window.icvData.firstSaleItemId = ...` 
            text = script.split("window.icvData = ")[-1].split('\n')[0].strip()
            
            try:
                data = json.loads(text)
                data['colors'] = colors
            except Exception as ex:
                print('Exception:', ex)
                print(text)
                return
    
            yield data
    
    # --- run without project and save in `output.csv` ---
    
    from scrapy.crawler import CrawlerProcess
    
    c = CrawlerProcess({
    #    'USER_AGENT': 'Mozilla/5.0',
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0',
        # save in file CSV, JSON or XML
        'FEEDS': {'output.json': {'format': 'json'}},  # new in 2.1
    })
    c.crawl(PwspiderSpider)
    c.start()