Search code examples
mysqlscrapyfingerprint

Bottleneck in scrapy middlewears MySQL select


I've test what bottleneck it is. It is from select query in middlewears.

class CheckDuplicatesFromDB(object):

 def process_request(self, request, spider):

    #  url_list is a just python list. some urls in there.
    if (request.url not in url_list):
        self.crawled_urls = dict()

        connection = pymysql.connect(host='123',
                                     user='123',
                                     password='1234',
                                     db='123',
                                     charset='utf8',
                                     cursorclass=pymysql.cursors.DictCursor)

        try:
            with connection.cursor() as cursor:
                # Read a single record

                sql = "SELECT `url` FROM `url` WHERE `url`=%s"
                cursor.execute(sql, request.url)
                self.crawled_urls = cursor.fetchone()

            connection.commit()
        finally:
            connection.close()

        if(self.crawled_urls is None):
            return None
        else:
            if (request.url == self.crawled_urls['url']):
                raise IgnoreRequest()
            else:
                return None
    else:
        return None

If I disable DOWNLOADER_MIDDLEWEARS in setting.py, scrapy crawl speed is not bad.

Before disabled:

scrapy.extensions.logstats] INFO: Crawled 4 pages (at 0 pages/min), scraped 4 items (at 2 items/min)

After disabled:

[scrapy.extensions.logstats] INFO: Crawled 55 pages (at 55 pages/min), scraped 0 items (at 0 items/min)

I guess the select query is the problem. So, I wanna select query once and getting a url data to put Request finger_prints.

I am using CrawlerProcess: the more spiders, the less crawled page/min.

Example:

  • 1 spiders => 50 pages/min
  • 2 spiders => total 30 pages/min
  • 6 spiders => total 10 pages/min

What I wanna do is:

  1. get a url data from MySQL
  2. put a url data to Request finger_prints

How can I do this?


Solution

  • One major problem is that you are opening a new connection to the sql database with each response / call to process_request. Instead open the connection once and keep it open.

    While this will result in a major speedup I suspect there are other bottlenecks, that come up once this is fixed.