Search code examples
web-scrapingscrapysession-cookiesmiddlewarescrapy-splash

How to keep splash cookies


I am currently trying to scrape a website and trying to stay logged in as I scrape. Unfortunately, from what I understand splash resets cookies at every splashrequest. I am using splash with scrapy to scrape a site with javascript. My question is: How do I keep my cookies from being reset?

After scraping the web myself for a solution, I know it has something to do with lua scripts or cookie middleware but I have no idea how to use them. If anyone could help it would be great. All the sites that talk about that are really unclear so please be as clear as possible.


Solution

  • Yes, you can set cookies and return cookies in lua scripts. If the login page and scraping page use same script, your script should be like this:

    function main(splash)
        splash:init_cookies(splash.args.cookies)
    
        -- ... your script
    
        return {
            cookies = splash:get_cookies(),
            -- ... other results, e.g. html
        }
    end
    

    If you use different scripts for login and scrape, u can return cookies from login_script and send it along with SplashRequest:

    yield SplashRequest(url = url, callback=self.item_parse, endpoint='execute',args={
                    'lua_source': self.scrape_script
                }, meta={'cookies': cookies})
    

    In scrape_script you need to set cookies using command:

     splash:init_cookies(splash.args.cookies)