Search code examples
pythonloopsgetharvest

Python - how to extract data by looping through paginated API (Harvest)


First of all I have been working with Python for about a couple of days, so I don't necessarily know the best practices or all the terminology ... yet. I learn best by reverse engineering and my code below is based on the official documentation from Harvest and other bits I've found with google-fu.

My request is to download all the time entries records from Harvest and save as a JSON (or ideally a CSV file).

Official Python Example from Harvest Git Hub

This is my adapted code (including all outputs, which won't be necessary in the final code but handy for my learning):

import requests, json, urllib.request

#Set variables for authorisation
AUTH = "REDACTED"
ACCOUNT = "REDACTED"

URL = "https://api.harvestapp.com/v2/time_entries"
HEADERS = { "Authorization": AUTH,
            "Harvest-Account-ID": ACCOUNT}
PAGENO = str("5")

request = urllib.request.Request(url=URL+"?page="+PAGENO, headers=HEADERS)
response = urllib.request.urlopen(request, timeout=5)
responseBody = response.read().decode("utf-8")
jsonResponse = json.loads(responseBody)

# Find the values for pagination
parsed = json.loads(responseBody)
links_first = parsed["links"]["first"]
links_last = parsed["links"]["last"]
links_next = parsed["links"]["next"]
links_previous = parsed["links"]["previous"]
nextpage = parsed["next_page"]
page = parsed["page"]
perpage = parsed["per_page"]
prevpage = parsed["previous_page"]
totalentries = parsed["total_entries"]
totalpages = parsed["total_pages"]

#Print the output
print(json.dumps(jsonResponse, sort_keys=True, indent=4))
print("first link : " + links_first)
print("last link : " + links_last)
print("next page : " + str(nextpage))
print("page : " + str(page))
print("per page : " + str(perpage))
print("total records : " + str(totalentries))
print("total pages : " + str(totalpages))

The output response is:

"Squeezed text (5816 lines)"
first link : https://api.harvestapp.com/v2/time_entries?page=1&per_page=100
last link : https://api.harvestapp.com/v2/time_entries?page=379&per_page=100
next page : 6
page : 5
per page : 100
total records : 37874
total pages : 379

Please can someone advise the best way to loop through the pages to form one JSON file ? If you are also able to advise the best way then output that JSON file I would be very grateful.


Solution

  • I have been using the following code to retrieve all time entries. It could be a bit more effective, perhaps, but it works. The function get_all_time_entries loops through all the pages and appends the response in JSON format into all_time_entries array and finally returns this array.

    import requests
    import json
    
    def get_all_time_entries():
    
        url_address = "https://api.harvestapp.com/v2/time_entries"  
        headers = {
            "Authorization": "Bearer " + "xxxxxxxxxx",
            "Harvest-Account-ID": "xxxxxx"
        }
    
        # find out total number of pages
        r = requests.get(url=url_address, headers=headers).json()
        total_pages = int(r['total_pages'])
    
        # results will be appended to this list
        all_time_entries = []
    
        # loop through all pages and return JSON object
        for page in range(1, total_pages):
    
            url = "https://api.harvestapp.com/v2/time_entries?page="+str(page)              
            response = requests.get(url=url, headers=headers).json()        
            all_time_entries.append(response)       
            page += 1
    
        # prettify JSON
        data = json.dumps(all_time_entries, sort_keys=True, indent=4)
    
        return data
    
    print(get_all_time_entries())
    

    You can easily direct the output of the script with ">" to local folder when running in powershell, etc.

    For example:

    Python.exe example.py > C:\temp\all_time_entries.json

    Hope this helps!