Search code examples
pythonwhile-loopinstagram

Python While Loop Problem - Instagram API Returns Pagination Objects but not new results


I am trying to extract a list of Instagram posts that have been tagged with a certain hashtag. I am using a RAPIDAPI found here. Instagram paginates the results which are returned, so I have to cycle through the pages to get all results. I am encountering a very strange bug/error where I am receiving the next page as requested, but the posts are from the previous page.

To use the analogy of a book, I can see page 1 of the book and I can request to the book to show me page 2. The book is showing me a page labeled page 2, but the contents of the page are the same as page 1.

Using the container provided by the RapidAPI website, I do not encounter this error. This leads me to believe that problem must be on my end, presumably in the while loop I have written.

If somebody could please review my 'while' loop, or suggest anything else which would correct the problem, I would greatly appreciate it. The list of index range error at the bottom is easily fixable, so I'm not concerned about it.

Other info: This particular hashtag has 694 results, and the API returns a page containing 50 items of results.

import http.client
import json
import time


conn = http.client.HTTPSConnection("instagram-data1.p.rapidapi.com") #endpoint supplied by RAPIDAPI
##Begin Credential Section
headers = {
    'x-rapidapi-key': "*removed*",
    'x-rapidapi-host': "instagram-data1.p.rapidapi.com"
    }
##End Credential Section
hashtag = 'givingtuesdayaus'

conn.request("GET", "/hashtag/feed?hashtag=" + hashtag, headers=headers)

res = conn.getresponse()
data = res.read()
print(data.decode("utf-8")) #Purely for debugging, can be disabled
json_dictionary = json.loads(data.decode("utf-8")) #Saving returned results into JSON format, because I find it easier to work with
i = 1 # Results need to cycle through pages, using 'i' to track the number of loops and for input in the name of the file which is saved
with open(hashtag + str(i) + '.json', 'w') as json_file:
    json.dump(json_dictionary['collector'], json_file)

#JSON_dictionary contains five fields, 'count' which is number of results for hashtag query, 'has_more' boolean indicating if there are additional pages
# 'end_cursor' string which can be added to the url to cycle to the next page, 'collector' list containing post information, and 'len'

#while loop essentially checks if the 'has_more' indicates there are additional pages, if true uses the 'end_cursor' value to cycle to the next page
while json_dictionary['has_more']:
    time.sleep(1)
    cursor = json_dictionary['end_cursor']
    conn.request("GET", "/hashtag/feed?hashtag=" + hashtag +'&end-cursor=' + cursor, headers=headers)
    res = conn.getresponse()
    data = res.read()
    json_dictionary = json.loads(data.decode("utf-8"))
    i += 1
    print(i)
    print(json_dictionary['collector'][1]['id'])
    print(cursor) #these three prints rows are only used for debugging.
    with open(hashtag + str(i) + '.json', 'w') as json_file:
        json.dump(json_dictionary['collector'], json_file)

Results from python console: (As you can see, cursor and 'i' advance, but post id remains the same. The saved JSON files also all contain the same posts.

> {"count":694,"has_more":true,"end_cursor":"QVFCd2pVdEN2d01rNkw3UmRKSGVUN1EyanBlYzBPMS15MkIyUG1VdHhjWlJWMDBwRmVhaEYxd0czSE0wMktFcGhfMnItak5ZOE1GTzJvd05FU0pTMWxmVg==","collector":[{"id":"2467140087692742224","shortcode":"CI9CtaaDU5Q","type":"GraphImage",.....}
> #shortened by poster 2 2464906276234990574 QVFCd2pVdEN2d01rNkw3UmRKSGVUN1EyanBlYzBPMS15MkIyUG1VdHhjWlJWMDBwRmVhaEYxd0czSE0wMktFcGhfMnItak5ZOE1GTzJvd05FU0pTMWxmVg==
> 3 2464906276234990574
> QVFDVUlROFVKVVB3SEwyR05MSzJHZ2V1UXZqSzlzTVFhWDNBM3hXNENMcThKWExwWU90RFRnRm1FNWtSRGtrbTdORFIwRlU2QWZaSVByOHZhSXFnQnJsVg==
> 4 2464906276234990574
> QVFEVFpheV9SeFZCcWlKYkc3NUZZdG00Rk5KMWJsQVBNakJlZDcyMGlTWm9rUTlIQzRoYjVtTU1uRmhJZG5TTFBSOXdhbHozVUViUjZEbVpLdjVUQlJtVQ==
> Traceback (most recent call last):   File "<input>", line 33, in
> <module> IndexError: list index out of range

Solution

  • Apologies for everyone who has read this far, I am an idiot.

    I have identified the error shortly after posting:

    conn.request("GET", "/hashtag/feed?hashtag=" + hashtag +'&end-cursor=' + cursor, headers=headers)
    

    'end-cursor' should be 'end_cursor'.