My goal is to extract geodata (lat and lon values), views, photo id's, url's and the date posted from the Flickr database inside the geographic boundaries of the city of Cologne, Germany. The data is then written in a csv file. The total number of results using just tags='Köln'
is around 110.000. I want to extract at least a 5-digit number of data points from that. To achieve that, I set three delimiters: the tag, the maximum upload date and the minimum upload date.
What already works: The data is successfully written into the csv.
What does not work yet: When I return the search results using xml.etree.ElementTree.dump()
, I can see that around 3,700 results are found for the respective search parameters. As far as I know, this number is inside the limit of 4,000 results per query set by Flickr. However, only between 700 and 1,000 data points are written into the csv file. The number is never the same and varies with every execution, which is weird, because I clearly defined the time frame. Also, despite adding a timer between calls using time.sleep(1)
I still get kicked out by the server from time to time (error code 500). After struggling a lot with the barely documented limits, I really can't tell why my code is still not working as intended.
The code I used is as follows:
import flickrapi
import os
import datetime
import time
## Only needed to explore the xml tree
# import xml
## API key and secret provided by Flickr
api_key = 'api key'
api_secret = 'api secret'
## Approximate geographic coordinates of the administrational boundaries of the city of Cologne
boundaries = '6.8064182,50.8300729,7.1528453,51.0837915'
## Counter for the ID column required by GIS software
id_count = 1
## Creation of an editable csv file and its top row
csv = open('flickr_data.csv', mode='a')
if(os.stat('flickr_data.csv').st_size == 0):
csv.write('ID,Photo_ID,Lat,Lon,Views,Taken_Unix,Taken,URL \n')
## Authentication of the Flickr API
flickr = flickrapi.FlickrAPI(api_key, api_secret)
## Page counter
page_number = 1
## Only needed to explore the xml tree
# test_list = flickr.photos_search(max_upload_date = '2020-07-09 23:59:59',min_upload_date = '2020-01-15 0:00:00',tags = 'Köln',bbox = boundaries,has_geo = '1',page = 1,extras = 'views',per_page = '250')
# xml.etree.ElementTree.dump(test_list)
## While loop keeps running until page 16 is reached. The total number of pages for the wanted search query is 452.
## However, Flickr only returns a number of photos equivalent to 16 pages of 250 results.
## At this point, the code is reiterated until the maximum number of pages is reached.
while page_number < 17:
## Flickr search for the geographic boundaries of Cologne, Germany.
photo_list = flickr.photos_search(tags = 'Köln',
max_upload_date = '2020-07-09 23:59:59',
min_upload_date = '2020-01-15 00:00:00',
bbox = boundaries,
has_geo = '1',
page = page_number,
extras = 'views',
per_page = '250') ## maximum allowed photos per page for bbox-delimited requests
## For loop keeps running as long as there are photos on a page
for photo in photo_list[0]:
## extraction of latitude and longitude data from the search results
geodata = flickr.photos_geo_getLocation(photo_id = photo.attrib['id'])
lat = geodata[0][0].attrib['latitude']
lon = geodata[0][0].attrib['longitude']
## extraction of views from the search results
views = photo.get('views')
## extraction and conversion of upload dates
photo_info = flickr.photos.getInfo(photo_id = photo.attrib['id'])
date_unix = int(photo_info[0][4].attrib['posted'])
date = datetime.datetime.utcfromtimestamp(date_unix).strftime('%Y-%m-%d %H:%M:%S')
url = 'https://www.flickr.com/photos/' + photo.attrib['owner'] + '/' + photo.attrib['id']
## the csv is filled with the acquired information
csv.write('%s,%s,%s,%s,%s,%s,%s,%s \n' % (id_count,
photo.attrib['id'],
lat,
lon,
views,
date_unix,
date,
url))
id_count += 1
## 1 second wait time between calls to prevent error code 500
time.sleep(1)
## Turns the page
page_number += page_number
## Total number of photos searched
print(sum(1 for line in open(flickr_data.csv))-1)
csv.close()
The following is an excerpt of the xml that gets returned by flickr.photos_search
<rsp stat="ok">
<photos page="1" pages="16" perpage="250" total="3755">
<photo id="50094525552" owner="98355876@N00" secret="6d66d421af" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="250" />
<photo id="50093709173" owner="98355876@N00" secret="90c31cac1d" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="260" />
<photo id="50093706783" owner="98355876@N00" secret="9521b8ba7d" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="224" />
<photo id="50093641658" owner="82692690@N02" secret="e26afb1e79" server="65535" farm="66" title="Cabecera. Catedral gótica de Colonia. JX3." ispublic="1" isfriend="0" isfamily="0" views="201" />
<photo id="50090280721" owner="98355876@N00" secret="cc0e2d7b8b" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="295" />
<photo id="50090278631" owner="98355876@N00" secret="8113aaa628" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="280" />
<photo id="50090277186" owner="98355876@N00" secret="73753c811d" server="65535" farm="66" title="-" ispublic="1" isfriend="0" isfamily="0" views="320" />
<photo id="50090150901" owner="136678496@N04" secret="6de14ca572" server="65535" farm="66" title="Good Morning" ispublic="1" isfriend="0" isfamily="0" views="104" />
<photo id="50089819277" owner="7283893@N05" secret="43e5290b07" server="65535" farm="66" title="Der Chef / The Boss" ispublic="1" isfriend="0" isfamily="0" views="421" />
The following is the output of the script with the ID count printed at the end of every for loop and the page number printed with every while loop:
1
2
3
4
5
6
7
8
9
(...)
245
246
247
248
249
250
-------- PAGE 2 --------
251
252
253
254
255
256
257
(...)
493
494
495
496
497
498
499
500
-------- PAGE 4 --------
501
502
503
504
505
506
507
(...)
743
744
745
746
747
748
749
750
-------- PAGE 8 --------
751
752
753
754
755
756
757
758
759
(...)
990
991
992
993
994
995
996
997
998
999
1000
-------- PAGE 16 --------
1001
1002
1003
1004
1005
-------- PAGE 32 --------
As you found, the pagenumber
doubles skipping much of the API results due to your iterator at end of while
loop:
page_number += page_number
To fix, simply adjust to increment by 1:
page_number += 1