Search code examples
pythonjsonlistwhile-loop

Unable to get multiple Json responses from URL


I am passing an id value to a API url to get JSON response, but getting only one response and rest all are throwing 500 errors. I collect the ids in a list and pass the id to API URL as a parameter in a while loop to extract the data.

###Get id in a variable##     
                                                                                                                                                               
df_filter=spark.sql("""select distinct ID from filter_view""")

rdd = df_filter.rdd
listOfRows = rdd.collect()
counter = 0
##total_results = []
while counter < len(listOfRows):
    url += '?ids=' + listOfRows[counter].ID 
    response = requests.get(url,headers=headers)
       
    if response.status_code == 200:
        json_response = response.json()
        ##total_results.append(json_response)
        df2 = pd.json_normalize(json_response, record_path=['entities'])
        display(df2)
        
    else:
        print("Error: HTTP status code " + str(response.status_code))
    counter +=1

I am getting output for only one ID and rest all end with 500 errors.

Desired output:

ID---ItemID--Details
1    100      text
1    101      text
2    200      text
2    300      text
3    400      sometext
3    500      sometext
   

Output I am getting:

ID---ItemID--Details
1    100     text    
1    101     text
Error: HTTP status code 500
Error: HTTP status code 500
Error: HTTP status code 500
Error: HTTP status code 500
Error: HTTP status code 500
Error: HTTP status code 500

Solution

  • The first iteration produces a valid URL: baseURL/?ids=1, but since it's built using concatenation and assignment, the second iteration produces baseURL/?ids=1?ids=2 when you want baseURL/?ids=2.

    while counter < len(listOfRows):
        response = requests.get(f'{url}?ids={listOfRows[counter].ID}', headers=headers)
    

    Does the API support GETting multiple resources in a single request? Typically, with a plural query parameter like ids, it will take either a comma-separated list of resource IDs (?ids=1,2,3) or an array (?ids[]=1&ids[]=2&ids[]=3, or ?ids=1&ids=2&ids=3). If so, it will be way more efficient, and more polite to the API provider, to make one such request.

    response = requests.get(
        url + '?ids=' + ','.join([row.ID for row in listOfRows]),
        headers=headers
    )
    

    You'll probably need to change the code to parse the new response.

    If multiple GET isn't supported, at least convert this to a for-loop. There's no need to keep track of counter and test counter < len(listOfRows), and it will improve readability.

    df_filter=spark.sql("""select distinct ID from filter_view""")
    
    rdd = df_filter.rdd
    listOfRows = rdd.collect()
    for row in listOfRows:
        response = requests.get(f'{url}?ids={row.ID}', headers=headers)
           
        if response.status_code == 200:
            json_response = response.json()
            df2 = pd.json_normalize(json_response, record_path=['entities'])
            display(df2)
            
        else:
            print("Error: HTTP status code " + str(response.status_code))
    

    Update: based on comment

    i have over 5000 ids that needs to be passed one by one. How can this be passed in a chunks of 20 each may be?

    Build URLs of the form ...?ids=1&ids=2&ids=3... with no more than 20 ids per URL.

    from itertools import islice
    def chunker(it: seq, chunksize):
        iterator = iter(it)
        while chunk := list(islice(iterator, chunksize)):
            yield chunk
    
    for id_chunk in chunker([row.ID for row in listOfRows], 20):
        response = requests.get(
            f'{url}?ids=' + '&ids='.join(id_chunk),
            headers=headers
        )
    

    The chunker() will split an iterable into lists with length <= chunksize. First filter listOfRows for just the IDs. Then chunk the IDs into lists of length 20. Build the URL and make the request. Thank you kafran for chunker().