Search code examples
pythonweb-scrapingscrapyscrapy-splash

scrapy response is nothing like the page source


I am trying to using scrapy shell to get in "ykc1.greatwestlife.com" which should be a public website, though there are lots of things if I look at page source manually, I can not get a correct response using scrapy.

scrapy shell response result

Do I need to use scrapy-splash in this case? any ideas? Thanks


Solution

  • You can actually see the two back-to-back requests, caused by

          <head>
            <script language="javascript">
                document.cookie = "cmsUserPortalLocale=en;path=/";
                document.cookie = "cmsTheme=advgwl;path=/";    
                document.cookie = "siteBrand="+escape(location.hostname)+"; path=/";
                window.location.reload(true);
            </script>
    

    the back-to-back requests

    where the first request is substantially smaller, and likely causing what you're experiencing. Thankfully, since the cookies appear to be static, you can reproduce that behavior quite easily:

    def parse(self, response):
        # this is required because the response that arrives to parse()
        # has session cookies but we need to add 3 more to them
        new_cookies = {
          "cmsUserPortalLocale": "en",
          "cmsTheme": "advgwl",
          "siteBrand": "ykc1.greatwestlife.com",
        }
        yield response.follow(url=request.url, cookies=new_cookies,
                              callback=self.parse_home)
    
    def parse_home(self, response):
        # and now you have the full body