Search code examples
pythonjsonsocratasoda

Using $limit and $offset to get more than 1,000 rows on a SODA API


I am pulling data with a SODA API using the following code in Python

response = requests.get('https://healthdata.gov/resource/uqq2-txqb.json')

The dataset contains 434,865 rows but when I use the API, it only returns the first 1,000 rows. I saw on another question that $limit can be used to get the first 50,000 rows but how would I combine this with $offset to get all 434,865 rows?

** I figured out how to use $offset and now have the resulting code, is there any way to condense this?

response1 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000')
response2 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=50001')
response3 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=100002')
response4 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=150003')
response5 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=200004')
response6 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=250005')
response7 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=300006')
response8 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=350007')
response9 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000&$offset=400008')

Solution

  • This is referred to as paging, and you can find documentation for example here: https://dev.socrata.com/docs/paging.html

    In there, it is also specified that there are two versions of API:

    • v2.0, where $limit can be max 50,000
    • v2.1, where $limit is unlimited

    The endpoint you are using seems to be support v2.1, at least based on this https://dev.socrata.com/foundry/healthdata.gov/uqq2-txqb so you should be able to use a large value for $limit and retrieve the entire set at once.

    Going the paging route, the $offset value is 0-based, hence your queries should be properly rewritten as:

    response1 = requests.get('https://healthdata.gov/resource/uqq2-txqb.json?$limit=50000')
    response2 = requests.get('https://healthdata.gov/resource/uqq2xqb.json?$limit=50000&$offset=50000')
    response3 = requests.get('https://healthdata.gov/resource/uqq2xqb.json?$limit=50000&$offset=100000')
    response4 = requests.get('https://healthdata.gov/resource/uqq2xqb.json?$limit=50000&$offset=150000')
    

    Note the alignment on multiples of $limit.