Search code examples
pythonpandascsvurllibzip

Writing large file from .zip archive to Pandas dataframe


This is a link to download a zip file including a 1GB of postcode level data published by the UK government's Office for National Statistics: https://www.arcgis.com/sharing/rest/content/items/19fac93960554b5e90840505bd73917f/data

Information on the data can be found here: http://geoportal.statistics.gov.uk/datasets/19fac93960554b5e90840505bd73917f

I have used this data in a data science application in Python, loading it into a Pandas dataframe. I have integrated this to a simple web page and am deploying it to the cloud. I do not want to include the large data in my repository which I am accessing from a AWS EC2 instance. Thus as I understand I have two options:

1) Include the zipped file in the repository and read the CSV into a Pandas dataframe.

2) Open the url, stream in the file and extract it in the script and then read the CSV into a Pandas dataframe.

The issue with both of these approaches is the zip file contains contents other than the csv file I need and I'm not sure how to specifically access this.

Another approach I considered was compressing just the individual csv I need before including it in the repository, but this seems to generate superfluous files:

('Multiple files found in compressed zip file %s', "['NSPCL_AUG19_UK_LU.csv', '__MACOSX/', '__MACOSX/._NSPCL_AUG19_UK_LU.csv']") 

so I have the same issue with not being able to point directly to the file I need.

Please let me know what best practice is and how to get the file I need into a Pandas dataframe.


Solution

  • You could do the following to extract the main csv file and load it directly into a data frame:

    from urllib.request import urlopen
    import zipfile, io
    
    url = 'https://www.arcgis.com/sharing/rest/content/items/19fac93960554b5e90840505bd73917f/data'
    
    archive = zipfile.ZipFile(io.BytesIO(urlopen(url).read())) # Takes some time
    
    csv_path = 'Data/NSPL_AUG_2019_UK.csv' # The desired csv file in the archive
    
    df = pd.read_csv(io.BytesIO(archive.read(csv_path))) # Takes some time
    
    >>> df
                 pcd      pcd2     pcds  ...   imd     calncv        stp
    0        AB1 0AA  AB1  0AA  AB1 0AA  ...  6808  S99999999  S99999999
    1        AB1 0AB  AB1  0AB  AB1 0AB  ...  6808  S99999999  S99999999
    2        AB1 0AD  AB1  0AD  AB1 0AD  ...  6808  S99999999  S99999999
    3        AB1 0AE  AB1  0AE  AB1 0AE  ...  5503  S99999999  S99999999
    4        AB1 0AF  AB1  0AF  AB1 0AF  ...  6668  S99999999  S99999999
    ...          ...       ...      ...  ...   ...        ...        ...
    2632799  ZE3 9JW  ZE3  9JW  ZE3 9JW  ...  4187  S99999999  S99999999
    2632800  ZE3 9JX  ZE3  9JX  ZE3 9JX  ...  4187  S99999999  S99999999
    2632801  ZE3 9JY  ZE3  9JY  ZE3 9JY  ...  4187  S99999999  S99999999
    2632802  ZE3 9JZ  ZE3  9JZ  ZE3 9JZ  ...  4187  S99999999  S99999999
    2632803  ZE3 9XP  ZE3  9XP  ZE3 9XP  ...  4187  S99999999  S99999999
    
    [2632804 rows x 41 columns]
    

    Now you could store the data frame on its own in a compressed file, as suggested by iliar:

    df.to_pickle('NSPL_AUG_2019_UK.pkl', compression='gzip') # Takes some more time
    

    When I tried this the resulting file size was around 69MB, so you shouldn't have any trouble hosting it in the cloud.

    To access the data frame again, simply load the pickled file:

    >>> df = pd.read_pickle('NSPL_AUG_2019_UK.pkl')
                 pcd      pcd2     pcds  ...   imd     calncv        stp
    0        AB1 0AA  AB1  0AA  AB1 0AA  ...  6808  S99999999  S99999999
    1        AB1 0AB  AB1  0AB  AB1 0AB  ...  6808  S99999999  S99999999
    2        AB1 0AD  AB1  0AD  AB1 0AD  ...  6808  S99999999  S99999999
    3        AB1 0AE  AB1  0AE  AB1 0AE  ...  5503  S99999999  S99999999
    4        AB1 0AF  AB1  0AF  AB1 0AF  ...  6668  S99999999  S99999999
    ...          ...       ...      ...  ...   ...        ...        ...
    2632799  ZE3 9JW  ZE3  9JW  ZE3 9JW  ...  4187  S99999999  S99999999
    2632800  ZE3 9JX  ZE3  9JX  ZE3 9JX  ...  4187  S99999999  S99999999
    2632801  ZE3 9JY  ZE3  9JY  ZE3 9JY  ...  4187  S99999999  S99999999
    2632802  ZE3 9JZ  ZE3  9JZ  ZE3 9JZ  ...  4187  S99999999  S99999999
    2632803  ZE3 9XP  ZE3  9XP  ZE3 9XP  ...  4187  S99999999  S99999999
    
    [2632804 rows x 41 columns]