Search code examples
pythonxpathscrapyenumerate

Trouble crawling game stores using Scrapy - HTML changes if there is discount & dealing with null


I am trying to use Scrapy to crawl a series of game stores and have had the same issue with them all. I am using XPath, and the HTML for game prices changes depending on whether the price is marked simply as £ 20.09, or £ 20.09 with a line through it and then £ 14.49 to show discount. I am happy to have two columns, was 20.09 (which would contain null values) and one after now 14.49, but I can't figure out how to have a null value instead of it just displacing all the following ones.

This is my code for the website cdkeys - https://www.cdkeys.com/pc/games?limit=50 which has both games with and without discounts.

allowed_urls = ['https://www.cdkeys.com/pc/games?limit=50?']
start_urls = ['https://www.cdkeys.com/pc/games/{pageno}?limit=50'.format(pageno=pageno)
    for pageno in range(1, 10)]

def parse(self, response):
    Games = response.xpath('//*[@id="root-wrapper"]/div/div[1]/div[2]/div[3]/div[2]/div[2]/ul/li/h2/a/text()').extract()
    Prices = response.xpath('//span[starts-with(@id, "product-price-")]/span[1]/span/text()').extract()
    for i, (Game, Price) in enumerate(zip(Games, Prices)):
        yield {'index': i, 'Game': Game, 'Price':Price}

The issue lies within the XPath for prices, I can either get the list of only discounted prices, or a list of prices only for games with no discount since the HTML is quite different for those categories.

What's stopping me from simply creating two lists is that since i'm using zip and enumerate it simply iterates through the first x amount of games until it runs out of prices, instead of linking each game to the corresponding price.

Any help with either displaying only the correct price in Prices, or finding a way of having empty values instead of displacing the following ones would be very much appreciated. I'm new to both python and web crawling and just trying to get my head around all this.


Solution

  • I would do it differently - iterate over the product items one by one and then locate the game names, regular prices and discount prices:

    def parse(self, response):
        for game in response.css("ul.products-grid li.item"):
            name = game.css("h2.product-name > a::text").extract_first()
            old_price = game.css(".regular-price .price::text,.old-price .price::text").extract_first()
            discount_price = game.css(".special-price .price::text").extract_first()
    
            yield {
                "name": name,
                "old_price": old_price,
                "discount_price": discount_price
            }
    

    For the first page, you would get the following output:

    {'old_price': u'$ 13.59', 'name': u'Stellaris: Utopia PC DLC', 'discount_price': None}
    {'old_price': u' $ 9.49 ', 'name': u'Insurgency PC', 'discount_price': u' $ 1.99 '}
    ...
    {'old_price': u' $ 81.59 ', 'name': u'Call of Duty Black Ops II 2 Digital Deluxe Edition PC ', 'discount_price': u' $ 13.59 '}
    

    Note how the old price is filled out with and without discounts.