Simple Salesforce query_more never ending

I am having some trouble understanding what is happening with my code:

import json
from simple_salesforce import Salesforce, SalesforceLogin

fileCount = 1
saveFilesPath ='<path>/'
fileName = saveFilesPath+'test_file'+str(fileCount)+'.json'
sf = Salesforce(username='<username>', password='<password>', security_token='<token>', domain='test' )

initialQuery = sf.query("SELECT id, name, createddate, lastmodifieddate FROM surveyquestionresponse__c")
nextChunk = initialQuery['nextRecordsUrl']
nextQuery = sf.query_more(nextChunk, True)

print(nextChunk)
print(nextQuery['nextRecordsUrl'])



#with open(fileName, 'w') as outfile :
#    json.dump(initialQuery['records'],outfile)

#while nextQuery['nextRecordsUrl'] is not None :
#    fileCount += 1
#    fileName = saveFilesPath+'test_file'+str(fileCount)+'.json'
#    print(nextQuery['nextRecordsUrl'])
#    with open(fileName, 'w') as outfile :
#        json.dump(nextQuery['records'], outfile)

Two things are happening for me with this. First is that the initial query gives /services/data/v38.0/query/01gf000000gFYRwAAO-2000 for the next records url, but then the nextQuery is giving /services/data/v38.0/query/01gf000000gFYRwAAO-4000 which is strange that it is changing the chunk amount.

The other thing that is happening is that the next chunk never ends. The object listed has about 95K rows in it, so it should in theory spit out about 25 files @ 4000 or 48 files @ 2000. I am unable to use Query_All due to the limitations of memory in lambda on AWS and the size of some of my objects, so I have to write files in pieces. How can I get this piece of code to function properly?

Solution

The ...AAO-2000 to ...AAO-4000 you noted is because each nextRecordsUrl contains the code to obtain the next 2000 record batch in the query. So ...AAo-2000 gets records 1-2000 (the first chunk), and at the end of the json object gives you the url to get records 2001-4000 (the next chunk). This is indicated in the url via the ...AAO-4000 notation.

I used the following code to iterate through a series of queries in my own org to capture all data in the query (~62500 records in total). I didn't encounter an issue never-ending chunking issue.

# Initiate list for returned data
pulls = []

# Pull initial Query
initialQuery = sf.query("SELECT id, createddate, lastmodifieddate FROM Task")

# Append initial query data to pulls
pulls.append({'len':len(initialQuery['records']),'url':initialQuery['nextRecordsUrl']})

# Assign nextChunk with 'nextRecordsUrl' value and re-query with new parameters
nextChunk = initialQuery['nextRecordsUrl']
nextQuery = sf.query_more(nextChunk,True)

# Append nextQuery data to pulls
pulls.append({'len':len(nextQuery['records']),'url':nextQuery['nextRecordsUrl']})

# set up while loop to re-query salesforce until returned
# query does not have a 'nextRecordsUrl' return value
x = True
while x == True:
    try:
        # Query new 'nextREcordsUrl'
        nextQuery = sf.query_more(nextQuery['nextRecordsUrl'],True)

        # append new query to pulls
        pulls.append({'len':len(nextQuery['records']),'url':nextQuery['nextRecordsUrl']})
    except: # This triggers when nextQuery['nextRecordsUrl'] does not exist
        # Append final data to pulls
        pulls.append({'len':len(nextQuery['records']),'url':None}) 

        # set x to False to end loop
        x = False 

# return pulls to view data
pulls

This is a proof of concept code that should apply to your situation with some modification. I would suggest changing the pulls.append({'len':len(nextQuery['records']),'url':None}) to append whatever relevant data you need from the query, or simply the entire json object. You can then combine the various json objects within the python script and export them into a single json file. Let me know if you'd like additional support in adapting the code to your situation.