This is a link to download a zip file including a 1GB of postcode level data published by the UK government's Office for National Statistics: https://www.arcgis.com/sharing/rest/content/items/19fac93960554b5e90840505bd73917f/data
Information on the data can be found here: http://geoportal.statistics.gov.uk/datasets/19fac93960554b5e90840505bd73917f
I have used this data in a data science application in Python, loading it into a Pandas dataframe. I have integrated this to a simple web page and am deploying it to the cloud. I do not want to include the large data in my repository which I am accessing from a AWS EC2 instance. Thus as I understand I have two options:
1) Include the zipped file in the repository and read the CSV into a Pandas dataframe.
2) Open the url, stream in the file and extract it in the script and then read the CSV into a Pandas dataframe.
The issue with both of these approaches is the zip file contains contents other than the csv file I need and I'm not sure how to specifically access this.
Another approach I considered was compressing just the individual csv I need before including it in the repository, but this seems to generate superfluous files:
('Multiple files found in compressed zip file %s', "['NSPCL_AUG19_UK_LU.csv', '__MACOSX/', '__MACOSX/._NSPCL_AUG19_UK_LU.csv']")
so I have the same issue with not being able to point directly to the file I need.
Please let me know what best practice is and how to get the file I need into a Pandas dataframe.
You could do the following to extract the main csv file and load it directly into a data frame:
from urllib.request import urlopen
import zipfile, io
url = 'https://www.arcgis.com/sharing/rest/content/items/19fac93960554b5e90840505bd73917f/data'
archive = zipfile.ZipFile(io.BytesIO(urlopen(url).read())) # Takes some time
csv_path = 'Data/NSPL_AUG_2019_UK.csv' # The desired csv file in the archive
df = pd.read_csv(io.BytesIO(archive.read(csv_path))) # Takes some time
>>> df
pcd pcd2 pcds ... imd calncv stp
0 AB1 0AA AB1 0AA AB1 0AA ... 6808 S99999999 S99999999
1 AB1 0AB AB1 0AB AB1 0AB ... 6808 S99999999 S99999999
2 AB1 0AD AB1 0AD AB1 0AD ... 6808 S99999999 S99999999
3 AB1 0AE AB1 0AE AB1 0AE ... 5503 S99999999 S99999999
4 AB1 0AF AB1 0AF AB1 0AF ... 6668 S99999999 S99999999
... ... ... ... ... ... ... ...
2632799 ZE3 9JW ZE3 9JW ZE3 9JW ... 4187 S99999999 S99999999
2632800 ZE3 9JX ZE3 9JX ZE3 9JX ... 4187 S99999999 S99999999
2632801 ZE3 9JY ZE3 9JY ZE3 9JY ... 4187 S99999999 S99999999
2632802 ZE3 9JZ ZE3 9JZ ZE3 9JZ ... 4187 S99999999 S99999999
2632803 ZE3 9XP ZE3 9XP ZE3 9XP ... 4187 S99999999 S99999999
[2632804 rows x 41 columns]
Now you could store the data frame on its own in a compressed file, as suggested by iliar:
df.to_pickle('NSPL_AUG_2019_UK.pkl', compression='gzip') # Takes some more time
When I tried this the resulting file size was around 69MB, so you shouldn't have any trouble hosting it in the cloud.
To access the data frame again, simply load the pickled file:
>>> df = pd.read_pickle('NSPL_AUG_2019_UK.pkl')
pcd pcd2 pcds ... imd calncv stp
0 AB1 0AA AB1 0AA AB1 0AA ... 6808 S99999999 S99999999
1 AB1 0AB AB1 0AB AB1 0AB ... 6808 S99999999 S99999999
2 AB1 0AD AB1 0AD AB1 0AD ... 6808 S99999999 S99999999
3 AB1 0AE AB1 0AE AB1 0AE ... 5503 S99999999 S99999999
4 AB1 0AF AB1 0AF AB1 0AF ... 6668 S99999999 S99999999
... ... ... ... ... ... ... ...
2632799 ZE3 9JW ZE3 9JW ZE3 9JW ... 4187 S99999999 S99999999
2632800 ZE3 9JX ZE3 9JX ZE3 9JX ... 4187 S99999999 S99999999
2632801 ZE3 9JY ZE3 9JY ZE3 9JY ... 4187 S99999999 S99999999
2632802 ZE3 9JZ ZE3 9JZ ZE3 9JZ ... 4187 S99999999 S99999999
2632803 ZE3 9XP ZE3 9XP ZE3 9XP ... 4187 S99999999 S99999999
[2632804 rows x 41 columns]