Search code examples
pythonscrapyscrapyd

Scrapy: Running multiple spider at scrapyd - python logical error


Scrapy 1.4

I am using this script (Run multiple scrapy spiders at once using scrapyd) to schedule multiple spiders at Scrapyd. Before I was using Scrapy 0.19 and was running fine.

I am receiving the error: TypeError: create_crawler() takes exactly 2 arguments (1 given)

So now I dont know if the problem is in Scrapy version or a simple python logical problem (I am new with python)

I did some modifications to check before if the spider is active on the database.

class AllCrawlCommand(ScrapyCommand):

    requires_project = True
    default_settings = {'LOG_ENABLED': False}

    def short_desc(self):
        return "Schedule a run for all available spiders"

    def run(self, args, opts):

        cursor = get_db_connection()
        cursor.execute("SELECT * FROM lojas WHERE disponivel = 'S'")
        rows = cursor.fetchall()

        # Coloco todos os dominios dos sites em uma lista
        # La embaixo faco uma verificacao para rodar somente os
        # que estao disponiveis e somente os que batem o dominio do site
        sites = []
        for row in rows:
            site = row[2]
            print site

            # adiciono cada site na lista 
            sites.append(site)

        url = 'http://localhost:6800/schedule.json'
        crawler = self.crawler_process.create_crawler()
        crawler.spiders.list()
        for s in crawler.spiders.list():
            #print s
            if s in sites:

                values = {'project' : 'esportifique', 'spider' : s}
                r = requests.post(url, data=values)
                print(r.text)

Solution

  • Based on parik suggested link, here's what I did:

    from scrapy.utils.project import get_project_settings
    from scrapy.crawler import CrawlerProcess
    import requests
    
    setting = get_project_settings()
    process = CrawlerProcess(setting)
    
    url = 'http://localhost:6800/schedule.json'
    
    cursor = get_db_connection()
    cursor.execute("SELECT * FROM lojas WHERE disponivel = 'S'")
    rows = cursor.fetchall()
    
    # Coloco todos os dominios dos sites em uma lista
    # La embaixo faco uma verificacao para rodar somente os
    # que estao disponiveis e somente os que batem o dominio do site
    sites = []
    for row in rows:
        site = row[2]
        print site
    
        # adiciono cada site na lista 
        sites.append(site)
    
    for spider_name in process.spiders.list():
        print ("Running spider %s" % (spider_name))
        #process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
        if spider_name in sites:
            values = {'project' : 'esportifique', 'spider' : spider_name}
            r = requests.post(url, data=values)