Search code examples
xpathweb-scrapingscrapyscrapy-shell

web-crawling - get item-title from bandcamp.com


I try to get the item-title from new releases at bandcamp.com from the 'Discover' part of the page (rock->all rock->new arrivals)

scrapy shell 'https://bandcamp.com/?g=rock&s=new&p=0&gn=0&f=all&w=0'

Part of the relevant source code of the page looks like this:

<div class="col col-3-12 discover-item">
            <a data-bind="click: playMe, css: { 'playing': playing }" class="item-link playable">
                <span class="item-img ratio-1-1">
                    <img class="art" data-bind="src_art: { 'art_id': artId, 'format': 'art_tags_large' }" src="https://f4.bcbits.com/img/a1631562669_9.jpg">
                    <span class="plb-btn">
                        <span class="plb-bg"></span>
                        <span class="plb-ic"></span>
                    </span>
                </span>
                </a><a data-bind="attr: { 'href': itemURL }, text: title, click: playMe" class="item-title" href="https://reddieseloff.bandcamp.com/album/dead-rebel?from=discover-new">Dead Rebel</a>
                <a data-bind="attr: { 'href': bandURL }, text: artist, click: playMe" class="item-artist" href="https://reddieseloff.bandcamp.com?from=discover-new">Red Diesel</a>
                <span class="item-genre" data-bind="text: genre">rock</span>

        </div>

I tried to get the text of item-title (in this example 'Dead Rebel') with the help of xpath:

 response.xpath('//div[@class="col col-3-12 discover-item"]//a[@class="item-title"]/text()').extract()

but it returns nothing.

 []

It's also not working for 'item-artist' so i wonder what i'm doing wrong.

I appreciate any help.


Solution

  • All of the data you seek is hidden in the a hidden div node inside of the page body.
    When your browser loads the webpage, javascript instructs how to unpack and display this data and since scrapy does not run any javscript you need to do this step yourself:

     # all of the data is under "<div id="pagedata" data-blob=" attribute
     data = response.css('div#pagedata::attr(data-blob)').extract()
     import json
     data = json.loads(data[0])
     # dig through this python dictionary to find your data   
     (it has pretty much everything, even more than the page displays)