I want to extract monthly temperature data from several netCDF files in different locations. Files are built as follows:
> print(data.variables.keys())
dict_keys(['lon', 'lat', 'time', 'tmp','stn'])
Files hold names like "tmp_1901_1910."
Here is the code I use:
import glob
import pandas as pd
import os
import numpy as np
import time
os.chdir('PATH/data_tmp')
all_years = []
for file in glob.glob('*.nc'):
data = Dataset(file,'r')
time_data = data.variables['time'][:]
time = data.variables['time']
year = str(file)[4:13]
all_years.append(year)
# Empty pandas dataframe
year_start = min(all_years)
end_year = max(all_years)
date_range = pd.date_range(start = str(year_start[0:4]) + '-01-01', end = str(end_year[5:9]) + '-12-31', freq ='M')
df = pd.DataFrame(0.0, columns = ['Temp'], index = date_range)
# Defining the location, lat, lon based on the csv data
cities = pd.read_csv(r'PATH/cities_coordinates.csv', sep =',')
cities['city']= cities['city'].map(str)
for index, row in cities.iterrows():
location = row['code_nbs']
location_latitude = row['lat']
location_longitude = row['lon']
# Sorting the list
all_years.sort()
for yr in all_years:
#Reading in the data
data = Dataset('tmp_'+str(yr)+'.nc','r')
# Storing the lat and lon data into variables of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
# Squared difference between the specified lat, lon and the lat, lon of the netCDF
sq_diff_lat = (lat - location_latitude)**2
sq_diff_lon = (lon - location_longitude)**2
# Retrieving the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the temperature data
tmp = data.variables['tmp']
start = str(yr[0:4])+'-01-01'
end = str(yr[5:11])+'-12-31'
d_range = pd.date_range(start = start, end = end, freq='M')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for: '+str(d_range[t_index]))
df.loc[d_range[t_index]]['Temp']=tmp[min_index_lon, min_index_lat, t_index]
df.to_csv(location +'.csv')
I obtain the following message while running the command df.loc[d_range[t_index]]['Temp']=tmp[min_index_lon, min_index_lat, t_index]
IndexError: index exceeds dimension bounds
I inspect the object's values and have:
print(d_range)
DatetimeIndex(['1901-01-31', '1901-02-28', '1901-03-31', '1901-04-30',
'1901-05-31', '1901-06-30', '1901-07-31', '1901-08-31',
'1901-09-30', '1901-10-31',
...
'1910-03-31', '1910-04-30', '1910-05-31', '1910-06-30',
'1910-07-31', '1910-08-31', '1910-09-30', '1910-10-31',
'1910-11-30', '1910-12-31'],
dtype='datetime64[ns]', length=120, freq='M')
On the first t_index within the loop, I have:
print(t_index)
0
print(d_range[t_index])
1901-01-31 00:00:00
print(min_index_lat)
259
print(min_index_lon)
592
I don't understand what went wrong with the dimensions.
Thank you for any help!
I assume, you want to read in all .nc
data and map the closest city to it. For that, I suggest to read all data first and afterwards calculate to which city a location belongs. The following code probably needs some adoptions to your data. It should show in which direction you could go to get the code more robust.
e.g. into a DataFrame(s). Depends if you can import all data at once. If not split step 1 and 2 into chunks
df_list = []
for file in glob.glob('*.nc'):
data = Dataset(file,'r')
df_i = pd.DataFrame({
variables.keys())
'time': data.variables['time'][:],
'lat': data.variables['lat'][:],
'lon': data.variables['lon'][:],
'tmp': data.variables['tmp'][:],
'stn': data.variables['stn'][:],
'year': str(file)[4:13], # maybe not needed as 'time' should have this info already, and [4:13] needs exactly this format
'file_name': file, # to track back the file
# ... and more
})
df_list.append(df_i)
df = pandas.concat(df_list, ignore_index=True)
e.g. with groupby
but there are several other methods. Depending on the amount of data, I suggest to use pandas or numpy routines over any python loops. They are way faster.
df['city'] = None
gp = df.groupby(['lon', 'lat'])
for values_i, indexes_i in gp.groups.items():
# Add your code to get the closest city
# values_i[0] is 'lon'
# values_i[1] is 'lat'
# e.g.:
diff_lon_lat = np.hypot(cities['lon']-values_i[0], cities['lat']-values_i[1])
location = cities.loc[diff_lon_lat.argmin(), 'code_nbs']
# and add the parameters to the df
df.loc[indexes_i, 'city'] = location