Search code examples
pythonweb-scrapingscrapyscrapy-splash

Access Denied: You don\'t have permission to access "http://www.airbnb.ca/rooms/48058366/" on this server


Is there any way to get around this error? I am using splash to grab the HTML, but the response.body returned gives me an access denied. I can view the data in chrome developer tool, but the HTML is unable to be extracted due to this error. Also, when I just use splash, I see the full HTML! I put my github link for anyone: https://github.com/ryanshrott/scraping/tree/master/demo_airbnb

Access Denied\n\n

Access Denied

\n \nYou don't have permission to access "http://www.airbnb.ca/rooms/48058366/" on this server.

\nReference #18.66cc94d1.1643648347.66b47664\n\n\n

'

import scrapy
from scrapy_splash import SplashRequest


class SimpleSpider(scrapy.Spider):
    name = 'simple'
    allowed_domains = ['airbnb.ca']

    script = '''function main(splash, args)
            assert(splash:go(args.url))
            assert(splash:wait(0.5))
            return {
                html = splash:html(),
            }
            end'''
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
           'AppleWebKit/537.36 (KHTML, like Gecko) '\
           'Chrome/75.0.3770.80 Safari/537.36'}

    def start_requests(self):
        yield SplashRequest(
            url='https://www.airbnb.ca/rooms/48058366/',
            callback=self.parse,
            args={"lua_source": self.script},
            headers = self.headers
       )

    def parse(self, response):
        yield { 'body' : response.body,
            'title': response.xpath("//h2[@class='_14i3z6h']/text()").get()}

Solution

  • When using a lua script, you need to send the request to the execute endpoint as shown in the code below. Also when using scrapy_splash be sure to include the required values in the settings.py file or the custom_settings spider argument as I have done below:

    import json
    import scrapy
    from scrapy_splash import SplashRequest
    
    
    class SimpleSpider(scrapy.Spider):
        name = 'simple'
        allowed_domains = ['airbnb.ca']
    
        custom_settings = dict(
            SPLASH_URL = 'http://localhost:8050',
            DOWNLOADER_MIDDLEWARES = {
                'scrapy_splash.SplashCookiesMiddleware': 723,
                'scrapy_splash.SplashMiddleware': 725,
                'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
            },
            SPIDER_MIDDLEWARES = {
                'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
            },
            DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter',
        )
    
        script = '''function main(splash, args)
                assert(splash:go(args.url))
                assert(splash:wait(0.5))
                return splash:html()
                end'''
        headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
               'AppleWebKit/537.36 (KHTML, like Gecko) '\
               'Chrome/75.0.3770.80 Safari/537.36'}
    
        def start_requests(self):
            yield SplashRequest(
                url='https://www.airbnb.ca/rooms/48058366/',
                callback=self.parse,
                args={"lua_source": self.script},
                endpoint='execute',
                headers = self.headers
           )
    
        def parse(self, response):
            data = response.xpath("//*[@id='data-deferred-state']/text()").get()
            yield json.loads(data)
    

    If you run the spider using scrapy crawl simple or scrapy runspider simple.py, you get below output

    Sample spider run