I am trying to using scrapy shell to get in "ykc1.greatwestlife.com" which should be a public website, though there are lots of things if I look at page source manually, I can not get a correct response using scrapy.
Do I need to use scrapy-splash in this case? any ideas? Thanks
You can actually see the two back-to-back requests, caused by
<head>
<script language="javascript">
document.cookie = "cmsUserPortalLocale=en;path=/";
document.cookie = "cmsTheme=advgwl;path=/";
document.cookie = "siteBrand="+escape(location.hostname)+"; path=/";
window.location.reload(true);
</script>
where the first request is substantially smaller, and likely causing what you're experiencing. Thankfully, since the cookies appear to be static, you can reproduce that behavior quite easily:
def parse(self, response):
# this is required because the response that arrives to parse()
# has session cookies but we need to add 3 more to them
new_cookies = {
"cmsUserPortalLocale": "en",
"cmsTheme": "advgwl",
"siteBrand": "ykc1.greatwestlife.com",
}
yield response.follow(url=request.url, cookies=new_cookies,
callback=self.parse_home)
def parse_home(self, response):
# and now you have the full body