Search code examples
pythonscrapyscrapy-splash

How to get dinamically-loaded content from this website using scrapy-splash?


I'm trying to get data from this website using scrapy-splash but im not able to extract data. I want to get data about each real state like href, price, etc. Here is my code:

in setings.py:

ROBOTSTXT_OBEY = False

USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"

SPLASH_ENABLED = True


DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPLASH_URL = 'http://localhost:8050/'

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

my spider:

class M2Spider(scrapy.Spider):
    
    name = "m2"
    allowed_domains = ['metrocuadrado.com']
    start_urls = [
            'https://www.metrocuadrado.com/bodega/arriendo'
            ]
    
    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url,callback= self.parse, 
                               endpoint='render.html',
                               args={'wait': 10},)
                  
    def parse(self, response):
        print("--------------------------------------------------------------")
        real_states= response.selector.xpath(".//a[@class='sc-bdVaJa ebNrSm']").getall()

        print("real_states")

The output print is an empty list []. I am new to splash. Any suggestions?


Solution

  • What I would do instead is this:

    Send a request to https://www.metrocuadrado.com/results/_next/static/chunks/commons.8afec6af6d5add2097bf.js, in the response you'll find an API-key if you search for "X-Api-Key". So that can be extracted easily with regex, something like: re.findall(r'"X-Api-Key":"(\w+)"').

    Then, when you've extracted the API key, send a request to https://www.metrocuadrado.com/rest-search/search?seo=/bodega/arriendo&from=0&size=50, which is the hidden API in the website you sent. To get a valid response you have to attach the header like this

    scrapy.Request(
        url=url_variable,
        headers={
            "x-api-key": api_key_variable_from_prev_step
        }
    )
    

    From that API you get JSON formatted data which is usually more reliable than parsing the html since that changes more oftan.