python web-scraping scrapy scrapy-splash

scrapy response is nothing like the page source

I am trying to using scrapy shell to get in "ykc1.greatwestlife.com" which should be a public website, though there are lots of things if I look at page source manually, I can not get a correct response using scrapy.

scrapy shell response result

Do I need to use scrapy-splash in this case? any ideas? Thanks

Solution

You can actually see the two back-to-back requests, caused by

      <head>
        <script language="javascript">
            document.cookie = "cmsUserPortalLocale=en;path=/";
            document.cookie = "cmsTheme=advgwl;path=/";    
            document.cookie = "siteBrand="+escape(location.hostname)+"; path=/";
            window.location.reload(true);
        </script>

where the first request is substantially smaller, and likely causing what you're experiencing. Thankfully, since the cookies appear to be static, you can reproduce that behavior quite easily:

def parse(self, response):
    # this is required because the response that arrives to parse()
    # has session cookies but we need to add 3 more to them
    new_cookies = {
      "cmsUserPortalLocale": "en",
      "cmsTheme": "advgwl",
      "siteBrand": "ykc1.greatwestlife.com",
    }
    yield response.follow(url=request.url, cookies=new_cookies,
                          callback=self.parse_home)

def parse_home(self, response):
    # and now you have the full body