I've test what bottleneck it is. It is from select query in middlewears.
class CheckDuplicatesFromDB(object):
def process_request(self, request, spider):
# url_list is a just python list. some urls in there.
if (request.url not in url_list):
self.crawled_urls = dict()
connection = pymysql.connect(host='123',
user='123',
password='1234',
db='123',
charset='utf8',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
# Read a single record
sql = "SELECT `url` FROM `url` WHERE `url`=%s"
cursor.execute(sql, request.url)
self.crawled_urls = cursor.fetchone()
connection.commit()
finally:
connection.close()
if(self.crawled_urls is None):
return None
else:
if (request.url == self.crawled_urls['url']):
raise IgnoreRequest()
else:
return None
else:
return None
If I disable DOWNLOADER_MIDDLEWEARS
in setting.py
, scrapy crawl speed is not bad.
Before disabled:
scrapy.extensions.logstats] INFO: Crawled 4 pages (at 0 pages/min), scraped 4 items (at 2 items/min)
After disabled:
[scrapy.extensions.logstats] INFO: Crawled 55 pages (at 55 pages/min), scraped 0 items (at 0 items/min)
I guess the select query is the problem. So, I wanna select query once and getting a url data to put Request finger_prints
.
I am using CrawlerProcess: the more spiders, the less crawled page/min.
Example:
What I wanna do is:
finger_prints
How can I do this?
One major problem is that you are opening a new connection to the sql database with each response / call to process_request
. Instead open the connection once and keep it open.
While this will result in a major speedup I suspect there are other bottlenecks, that come up once this is fixed.