Search code examples
pythonscrapyscrapy-splash

scrapy.FormRequest.from_response VS. SplashFormRequest.from_response


I'm trying to log-in using scrapy splash in exact same way as scrapy only. I've looked at documentation, Doc, it says "SplashFormRequest.from_response is also supported, and works as described in scrapy documentation" However, simple changing one line of code and changing settings as described in splash documentation doesn't bring any results. What I'm doing wrong? Code:

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = 'lost'
    start_urls = ["myurl",]

def parse(self, response):
    return SplashFormRequest.from_response(
        response,
        formdata={'username': 'pass', 'password': 'pass'},
        callback=self.after_login
    ) 

def after_login(self, response):
    print response.body
    if "keyword" in response.body:
        self.logger.error("Success")
    else:
        self.logger.error("Failed")

Added to settings:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':     810,
                           }

SPLASH_URL = 'http://localhost:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

Error log:

python@debian:~/Python/code/lostfilm$ scrapy crawl lost
2017-01-26 20:24:22 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot:   lostfilm)
2017-01-26 20:24:22 [scrapy.utils.log] INFO: Overridden settings:   {'NEWSPIDER_MODULE': 'lostfilm.spiders', 'ROBOTSTXT_OBEY': True,  'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_MODULES': ['lostfilm.spiders'], 'BOT_NAME': 'lostfilm', 'HTTPCACHE_STORAGE':   'scrapy_splash.SplashAwareFSCacheStorage'}
2017-01-26 20:24:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
Unhandled error in Deferred:
2017-01-26 20:24:22 [twisted] CRITICAL: Unhandled error in Deferred:

2017-01-26 20:24:22 [twisted] CRITICAL: 
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py",  line 1299, in _inlineCallbacks
  result = g.send(result)
 File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line  90, in crawl
six.reraise(*exc_info)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 69, in __init__
self.downloader = downloader_cls(crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/__init__.py", line 88, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/misc.py", line 49, in load_object
raise NameError("Module '%s' doesn't define any object named '%s'" %  (module, name))
NameError: Module 'scrapy.downloadermiddlewares.httpcompression' doesn't  define any object named 'HttpCompresionMiddlerware'

Solution

  • You probably need to perform the 1st request with Splash too.

    By default, the start_urls attribute will issue "simple" scrapy.Request, not SplashRequest.

    You need to override start_requests method for your spider:

    class MySpider(scrapy.Spider):
        name = 'lost'
        start_urls = ["myurl",]
    
        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(url)
        ...