Search code examples
pythonpandascsvpython-requestsurllib

Getting encoded csv from url into Pandas


I'm struggling to get the following csv from this site into Pandas.

https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv

I've tried a few things, but so far I can't make a workable csv. The ultimate aim is to be able to make it into a Pandas dataframe.

Can anyone help point me in the right direction and explain why the below doesn't work?

Using Python 3.7, Windows 10

import requests  
import urllib
import csv

csv_url = 'https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv'

response = urllib.request.urlopen(csv_url)
cr = csv.reader(response)
for row in cr:
    print(row)
# csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

response = urllib.request.urlopen(csv_url)
response = response.read().decode()
cr = csv.reader(response)
for row in cr:
    print(row)
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 28452: invalid start byte

response = requests.get(csv_url).text
cr = csv.reader(response)
for row in cr:
    print(row)
# malformed, prints individual characters

Solution

  • If you're using pandas >= 0.19.2, you can input the csv url directly.:

    import pandas as pd
    url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
    c=pd.read_csv(url, encoding ='latin1') # otherwise you get a UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 12: invalid start byte  
    

    Demo1


    Otherwise use String.IO, i.e.:

    import pandas as pd
    import requests
    from io import StringIO
    url="https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/773656/HMRC_spending_over_25000_for_December_2018.csv"
    s=requests.get(url).content
    c=pd.read_csv(StringIO(s.decode("latin1")))
    

    Demo2