Search code examples
pythonluascrapyscrapy-splash

How do you clear scrapy splash cache after running in a container


*EDIT: It is not specific to Zyte, I have the same issue when running in a docker container. I have to throw an exception to log in. I need it to work for multiple runs while having the container be persistent.

I have a login function here that was originally working reliably, however it stopped working and I've been trying to figure out why. During my attempts to fix the problem, I created a new function that checks login success and then repeats the login with a different lua script. However, this did not work, and any new combination of lua script failed to log in even though it should be working.

def start_requests(self):
        login_url = "URL"
        yield SplashRequest(
            login_url,
            splash_headers={
                # used for initializing zyte splash
                "Authorization": basic_auth_header(self.settings["APIKEY"], "")
            },
            callback=self.start_scraping,
            endpoint="execute",
            cache_args=["lua_source"],
            args={
                "wait": random.uniform(2.1, 3.4),
                "lua_source": self.LUA_LOGIN_SOURCE,
                "url": login_url,
                "ua": self.random_user_agent,
            },
            meta={
                "max_retry_times": 0,
            },
        )

Then, I mistakenly added exit() without the requirement/status code and it threw an unhandled in Deferred error. Then, start_scraping ran again, and this time it was logged in and working fine.

I'm guessing this has something to do with the cache in Zyte or scrapy and maybe the way that I'm storing cookies. The cookies expired at the end of the last session but I don't think they should persist in the Zyte instance.

Would throwing an error reset Scrapy Splash in Zyte and is that why it is fixing the problem? I need to figure out why this is happening so I don't need to throw an unhandled error at the start of every job hahaha

Here is the QUICK AND DIRTY check/retry function:

 def start_scraping(self, response):
        if response.xpath("/html/body/h1//text()").extract_first() != "Forbidden":
            while self.login_counter < 4:
                self.login_counter += 1
                
                if self.login_counter == 1:
                    self.LUA_LOGIN_SOURCE
                # if the login function fails on attempt 3,
                # then try window-not-selected lua version
                elif self.login_counter == 2:
                    lua_source = self.LUA_LOGIN2_TAB_SOURCE
                # try unhandled error (logs in successfully)
                elif self.login_counter == 3:
                    exit()
                login_url = "URL"
                yield SplashRequest(
                    login_url,
                    splash_headers={
                        # used for initializing zyte splash
                        "Authorization": basic_auth_header(self.settings["APIKEY"], "")
                    },
                    callback=self.start_scraping,
                    endpoint="execute",
                    cache_args=["lua_source"],
                    args={
                        "wait": random.uniform(4.1, 5.4),
                        "lua_source": lua_source,
                        "url": login_url,
                        "ua": self.random_user_agent,
                    },
                    meta={
                        "max_retry_times": 0,
                    },
                )
        else:
            print("Logged in!")

An example of the working lua function:

function main(splash, args)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{args.url, headers=splash.args.headers, http_method=splash.args.http_method, body=splash.args.body,})
  assert(splash:wait(1))
  
  -- splash:send_keys("<Tab>")
  splash:send_text('<USERNAME>')
  assert(splash:wait(0.5))
  splash:send_keys("<Tab>")
  assert(splash:wait(0.5))
  splash:send_text('<PASSWORD>')
  assert(splash:wait(0.5))
  splash:send_keys("<Return>")
  assert(splash:wait(2))
  
  return {
    html = splash:html(),
    har = splash:har(),
    png = splash:png(),
    cookies = splash:get_cookies(),
  }
end

*Edit I am also storing cookies in a dictionary, but I don't know if it's relevant:

cookies_dict = {
            cookie["name"]: cookie["value"] for cookie in response.data["cookies"]
        }

Solution

  • The issue was that in the LUA script, I was calling splash:init_cookies(splash.args.cookies) in the start_requests method. By default, scrapy stores cookies in the cookie jar. When you call splash:init_cookies, you are retrieving the cookies from the cookie jar and replacing the cookies in your request. Since there were no cookies in the cookie jar, the request failed. Then, since the new request cookies were subsequently not added to the cookie jar, in the next request retry the same error would repeat.

    By throwing an unhandled error, I guess the cookie jar contents were updated without the LUA script replacing them, then on the next loop, it was able to initialize the correct cookies from the cookie jar and run successfully.

    Removing splash:init_cookies(splash.args.cookies) from the initial request solved the problem. Keeping init in the page extraction LUA script (not the login script) continued to work to transfer the login cookies to each new page.