python pandas csv python-requests urllib

Downloading a CSV file from a dynamic webpage in Python

A CSV file is periodically uploaded to a known, constant URL (url_variable). I want to automatically download the latest iteration of that CSV file in the course of a Python script.

I have tried using Pandas, specifically pd.read_csv(url_variable), but I receive the "HTTP Error 403: Forbidden."

Next I tried using urllib and passing in spoofed headers (headers_variable), specifically urllib.requests.Request(url_variable, headers=headers_variable). This technique works. However, when a new CSV file is uploaded to the URL and the script is repeated, the old CSV file is returned.

How can I alter my code to download the new CSV file each time this block is called?

Solution

Check if url is the same for new CSV uploads. If it's the same just downloading it should work.

Here's an example of downloading a CSV file in memory and reading it directly using requests and pandas:

from io import StringIO
import pandas as pd
import requests
                
if __name__ == "__main__":
        
    url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'
    headers = {"Authorization": "Test"}
    response = requests.get(url, headers=headers)
    df = pd.read_csv(StringIO(response.text))
    print(df.shape)

Of course, adjust headers as you wish. If the file is large, you could use a temporary file in order to process it, see: Generate temporary files and directories