Search code examples
pythonpandascsvpython-requestsurllib

Downloading a CSV file from a dynamic webpage in Python


A CSV file is periodically uploaded to a known, constant URL (url_variable). I want to automatically download the latest iteration of that CSV file in the course of a Python script.

I have tried using Pandas, specifically pd.read_csv(url_variable), but I receive the "HTTP Error 403: Forbidden."

Next I tried using urllib and passing in spoofed headers (headers_variable), specifically urllib.requests.Request(url_variable, headers=headers_variable). This technique works. However, when a new CSV file is uploaded to the URL and the script is repeated, the old CSV file is returned.

How can I alter my code to download the new CSV file each time this block is called?


Solution

  • Check if url is the same for new CSV uploads. If it's the same just downloading it should work.

    Here's an example of downloading a CSV file in memory and reading it directly using requests and pandas:

    from io import StringIO
    import pandas as pd
    import requests
                    
    if __name__ == "__main__":
            
        url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'
        headers = {"Authorization": "Test"}
        response = requests.get(url, headers=headers)
        df = pd.read_csv(StringIO(response.text))
        print(df.shape)
    

    Of course, adjust headers as you wish. If the file is large, you could use a temporary file in order to process it, see: Generate temporary files and directories