For my university assignment, I have to produce a csv file with all the distances of the airports of the world... the problem is that my csv file weight 151Mb. I want to reduce it as much as i can: This is my csv:
and this is my code:
# drop all features we don't need
for attribute in df:
if attribute not in ('NAME', 'COUNTRY', 'IATA', 'LAT', 'LNG'):
df = df.drop(attribute, axis=1)
# create a dictionary of airports, each airport has the following structure:
airport_dict = {}
for airport in df.itertuples():
airport_dict[airport[3]] = (airport[1], airport[2], airport[4], airport[5])
# From tutorial 4 solution:
for i, airport_code1 in enumerate(airportcodes):
airport1 = airport_dict[airport_code1]
for j, airport_code2 in enumerate(airportcodes):
if j > i:
airport2 = airport_dict[airport_code2]
# little edit: no need to calculate the distance twice, all duplicates are set to 0 distance
# set all 0 distance values to NaN
airportdists = airportdists.replace(0, np.nan)
airportdists.to_csv(r'../Project Data Files-20190322/distances.csv')
I also tried re-indexing it before saving:
# remove all NaN values
airportdists = airportdists.stack().reset_index()
airportdists.columns = ['airport1','airport2','distance']
but the result is a dataframe with 3 columns and 17 million columns and a disk size of 419Mb... quite not an improvement...
Can you help me shrink the size of my csv? Thank you!
I have done a similar application in the past; here's what I will do:
It is difficult to shrink your file, but if your application needs to have for example a distance between an airport from others, I suggest you to create 9541 files, each file will be the distance of an airport to others and its name will be name of airport.
In this case the loading of file is really fast.