Search code examples
pythonxpathscrapyweb-crawlerscrapy-splash

Scraping images in a dynamic, JavaScript webpage using Scrapy and Splash


I am trying to scrape the link of a hi-res image from this link but the high-res version of the image can only be inspected upon clicking on the mid-sized link on the page, i.e after clicking "Click here to enlarge the image" (on the page, it's in Turkish).
Then I can inspect it with Chrome's "Developer Tools" and get the xpath/css selector. Everything is fine up to this point.

However, you know that in a JS page, you just can't type response.xpath("//blah/blah/@src") and get some data. I install Splash (with Docker pull) and configure my Scrapy setting.py files etc. to make it work (this YouTube link helped. no need to visit the link unless you wanna learn how to do it). ...and it worked on other JS webpages!

Just... I cannot pass this "Click here to enlarge the image!" thing and get the response. It gives me null response.

This is my code:

import scrapy
#import json
from scrapy_splash import SplashRequest

class TryMe(scrapy.Spider):
    name = 'try_me'
    allowed_domains = ['arabam.com']

    def start_requests(self):
        start_urls = ["https://www.arabam.com/ilan/sahibinden-satilik-hyundai-accent/bayramda-arabasiz-kalmaa/17753653",
        ]

        for url in start_urls:
            yield scrapy.Request(url=url, 
            callback=self.parse, 
            meta={'splash': {'endpoint': 'render.html', 'args': {'wait': 0.5}}})
            # yield SplashRequest(url=url, callback=self.parse)  # this works too

    def parse(self, response):
        ## I can get this one's link successfully since it's not between js codes:
        #IMG_LINKS = response.xpath('//*[@id="js-hook-for-ing-credit"]/div/div/a/img/@src').get() 
        ## but this one just doesn't work:      
        IMG_LINKS = response.xpath("/html/body/div[7]/div/div[1]/div[1]/div/img/@src").get()
        print(IMG_LINKS)  # prints null :(
        yield {"img_links":IMG_LINKS}  # gives the items: img_links:null

Shell command which I'm using:
scrapy crawl try_me -O random_filename.jl

Xpath of the link I'm trying to scrape:
/html/body/div[7]/div/div[1]/div[1]/div/img

Image of this Xpath/link

I actually can see the link I want on the Network tab of my Developer Tools window when I click to enlarge it but I don't know how to scrape that link from that tab.

Possible Solution: I also will try to get the whole garbled body of my response, i.e response.text and apply a regular expression (e.g start with https://... and ends with .jpg) to it. This will definitely be looking for a needle in a haystack but it sounds quite practical as well.

Thanks!


Solution

  • As far as I understand you want to find the main image link. I checked out the page, it is inside the one of meta element:

    <meta itemprop="image" content="https://arbstorage.mncdn.com/ilanfotograflari/2021/06/23/17753653/3c57b95d-9e76-42fd-b418-f81d85389529_image_for_silan_17753653_1920x1080.jpg">
    
    

    Which you can get with

    >>> response.css('meta[itemprop=image]::attr(content)').get()
    'https://arbstorage.mncdn.com/ilanfotograflari/2021/06/23/17753653/3c57b95d-9e76-42fd-b418-f81d85389529_image_for_silan_17753653_1920x1080.jpg'
    

    You don't need to use splash for this. If I check the website with splash, arabam.com gives permission denied error. I recommend not using splash for this website.

    For a better solution for all images, You can parse the javascript. Images array loaded with js right here in the source.

    enter image description here

    To reach out that javascript try:

      response.css('script::text').getall()[14]
    

    This will give you the whole javascript string containing images array. You can parse it with built-in libraries like js2xml.

    Check out how you can use it here https://github.com/scrapinghub/js2xml. If still have questions, you can ask. Good luck