Search code examples
mongodbweb-scrapingindexingpymongoinsert-update

how to optimize update query in pymongo for scraping project


how to create and refresh index in pymongo to speed up update queries. As mentioned in the article[1] section, the following is code works fine for small set of entries

    self.collection.update({'url': item['url']}, dict(item), upsert=True)

But once it reaches in tens of thousands, it is very slow.

[1] https://realpython.com/web-scraping-and-crawling-with-scrapy-and-mongodb/#mongodb


Solution

  • Create an index on url field

    https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.create_index

    https://docs.mongodb.com/manual/indexes/

    self.collection.create_index('url')
    

    In your case url will be unique, you can create a unique index.

    https://docs.mongodb.com/manual/core/index-unique/#unique-indexes

    self.collection.create_index('url', unique = True)
    

    Note- If you've huge existing data create the index in the background

    https://docs.mongodb.com/manual/core/index-creation/