Search code examples
pythonweb-scrapingscrapyscreen-scrapingscrapy-shell

I got TypeError when using Scrapy View


I am trying to use scrapy view https://www.example.com (not the real link since I am not allowed to disclose it by my job. Sorry.) to debug the link, but then I got this error.

2018-11-01 20:49:29 [twisted] CRITICAL: Unhandled error in Deferred:

2018-11-01 20:49:29 [twisted] CRITICAL:
Traceback (most recent call last):
  File "d:\kerja\hit\python projects\my_project\my_project-env\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "d:\kerja\hit\python projects\my_project\my_project-env\lib\site-packages\scrapy\crawler.py", line 98, in crawl
    six.reraise(*exc_info)
  File "d:\kerja\hit\python projects\my_project\my_project-env\lib\site-packages\scrapy\crawler.py", line 79, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "d:\kerja\hit\python projects\my_project\my_project-env\lib\site-packages\scrapy\crawler.py", line 102, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "d:\kerja\hit\python projects\my_project\my_project-env\lib\site-packages\scrapy\spiders\__init__.py", line 51, in from_crawler
    spider = cls(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'start_requests'
'page' is not recognized as an internal or external command,
operable program or batch file.

How to not get that error?

UPDATE:

I get that error on one of my Scrapy project but I don't get any error when using my other Scrapy project. It seems to be a problem in the spider.


Solution

  • 1.

    As mentioned by Elena in his/her answer, the sample command you gave wasn't quoted. You'll need to properly handle the & character (by quoting the command or at least escaping that character) to pass the right URL to Scrapy as an argument.

    While this is something that needs to be resolved, I don't think that's the cause of the TypeError you currently have.

    2.

    When handling commands like scrapy fetch and scrapy view, Scrapy would need to initialize a scrapy.Spider instance for the task.

    During the process, Scrapy would look for a scrapy.cfg file at the current path, and:

    • Case A: If there is such a file, Scrapy would recognize the project at the current working path, and try loading an existing scrapy.Spider class within.
    • Case B: If not, which means there's no Scrapy project available, Scrapy would just initialize a default scrapy.Spider instance.

    According to the log you shared, it's case A you're having.

    What's more, when handling a scrapy fetch command Scrapy would try overriding the start_requests attribute via spider arguments (related code here). And according to the log you shared, your spider does not accept such an argument.

    Thus you may try any of these approaches:

    • Proposal A: Change the working directory to somewhere else, where there's no Scrapy project (e.g. cd /tmp/). Then retry the same scrapy fetch command.
    • Proposal B: Properly handle the input arguments (example below), then retry the same scrapy fetch command.

    In either case, you might need to fix the scrapy fetch command as mentioned in #1.

    3.

    Sample code for proposal B above:

    import scrapy
    
    
    class TestSpider(scrapy.Spider):
        name = 'test'
    
        def __init__(self, argument_foo, argument_bar, *args, **kwargs):
            super().__init__(*args, **kwargs)
            # handle your argument "foo" and "bar" here
            # e.g. self.xxx = int(argument_foo)