I'm trying to extract all the geotagged photos from Flickr using the Flickr API method flickr.photos.search()
. Here is the code:
import flickr_api
import urllib2
from flickr_api.api import flickr
flickr_api.set_keys(api_key = 'my_api_key', api_secret = 'my_api_secret')
flickr_api.set_auth_handler("AuthToken")
for i in range(1, 1700):
photo_list = flickr.photos.search(api_key='my_api_key', has_geo=1, extras='description,license,geo,tags,machine_tags', per_page=250, page=i, min_upload_date='972518400', accuracy=12)
f = open('xmldata1/photodata' + str(i) + '.xml','w')
f.write(photo_list)
f.close()
This script runs to give me an xml file for each page of the data. Each xml file has 250 photos data. There are 1699 such xml files. I get approximately 420,000 photos data with a lot of duplicates. After removing the duplicates, I got only 9022 unique images.
I have read here that it is safe to query for 16 pages = 4000 images at once to avoid duplicates.
I want to avoid duplicate images as much as possible and I require 100,000+ unique geotagged images for gps clustering purpose.
What time lag should I insert between two instances of the query? If I must consider another approach, please elaborate on it.
Let me know if you have any queries. Any help would be appreciated!
Try using a max_upload_date along with the min_upload_date. Keep a time frame of a couple of days and keep shifting the time frame from the min_upload_date to the max_upload_date. Search for photos within that time frame only.