I have a file which scrapes the data from a sitemap .xml file, and downloads some data from each page. Each time I start the scrape function, I go to this xml to get the URLs to scrape, put them in a list, compare them to the list of url's which I've already downloaded, and therefore just scrape the remainder. I use these URLs as the _id of a mongodb:
list_of_ids = collection.find().distinct('_id')
start_urls = list(set(new_url_list)-(set(list_of_ids)))
I've now come under a point of receiving the following error:
pymongo.errors.OperationFailure: distinct too big, 16mb cap, full error: {'ok': 0.0, 'errmsg': 'distinct too big, 16mb cap', 'code': 17217, 'codeName': 'Location17217'}
I assume I could just iterate through the database and append each _id to a list, but whilst I'm not hell-bent on performance, there must be a better way?
_id
s have to be distinct, so not sure why you need to use distinct at all, unless it's an easy way to get a list without the clutter.
Try this instead:
list_of_ids = [x['_id'] for x in collection.find({}, {'_id': 1})]
In fact if you want a set, use the similar:
set_of_ids = {x['_id'] for x in collection.find({}, {'_id': 1})}