Search code examples
pythonnetcdfdimensionsnetcdf4

Trouble with dimensions in netcdf : index exceeds dimension bounds


I want to extract monthly temperature data from several netCDF files in different locations. Files are built as follows:

> print(data.variables.keys())
dict_keys(['lon', 'lat', 'time', 'tmp','stn'])

Files hold names like "tmp_1901_1910."

Here is the code I use:

import glob
import pandas as pd
import os
import numpy as np
import time 

 
os.chdir('PATH/data_tmp')
all_years = []

for file in glob.glob('*.nc'):
    data = Dataset(file,'r')
    time_data = data.variables['time'][:]
    time = data.variables['time']
    year =  str(file)[4:13]

    all_years.append(year)
   
# Empty pandas dataframe
year_start = min(all_years)
end_year = max(all_years)

date_range = pd.date_range(start = str(year_start[0:4]) + '-01-01', end = str(end_year[5:9]) + '-12-31', freq ='M')

df = pd.DataFrame(0.0, columns = ['Temp'], index = date_range)


# Defining the location, lat, lon based on the csv data 
cities = pd.read_csv(r'PATH/cities_coordinates.csv', sep =',')


cities['city']= cities['city'].map(str)


for index, row in cities.iterrows():
    location = row['code_nbs']
    location_latitude = row['lat']
    location_longitude = row['lon']
     
    # Sorting the list
    all_years.sort()
    
    for yr in all_years:
        #Reading in the data
        data = Dataset('tmp_'+str(yr)+'.nc','r')
        
        # Storing the lat and lon data into variables of the netCDF file into variables
        lat = data.variables['lat'][:]
        lon = data.variables['lon'][:]
    
        # Squared difference between the specified lat, lon and the lat, lon of the netCDF
        sq_diff_lat = (lat - location_latitude)**2
        sq_diff_lon = (lon - location_longitude)**2
        
        
        # Retrieving the index of the min value for lat and lon
        min_index_lat = sq_diff_lat.argmin()
        min_index_lon = sq_diff_lon.argmin()
            
        # Accessing the temperature data
        tmp  = data.variables['tmp']
        
        start = str(yr[0:4])+'-01-01'
        end = str(yr[5:11])+'-12-31'
        d_range = pd.date_range(start = start, end = end, freq='M')
        
        for t_index in np.arange(0, len(d_range)):
             print('Recording the value for: '+str(d_range[t_index]))
             df.loc[d_range[t_index]]['Temp']=tmp[min_index_lon, min_index_lat, t_index]
           
    df.to_csv(location +'.csv')

I obtain the following message while running the command df.loc[d_range[t_index]]['Temp']=tmp[min_index_lon, min_index_lat, t_index]

IndexError: index exceeds dimension bounds

I inspect the object's values and have:

print(d_range)
DatetimeIndex(['1901-01-31', '1901-02-28', '1901-03-31', '1901-04-30',
               '1901-05-31', '1901-06-30', '1901-07-31', '1901-08-31',
               '1901-09-30', '1901-10-31',
               ...
               '1910-03-31', '1910-04-30', '1910-05-31', '1910-06-30',
               '1910-07-31', '1910-08-31', '1910-09-30', '1910-10-31',
               '1910-11-30', '1910-12-31'],
              dtype='datetime64[ns]', length=120, freq='M')

On the first t_index within the loop, I have:

print(t_index)
0

print(d_range[t_index])
1901-01-31 00:00:00

print(min_index_lat)
259
print(min_index_lon)
592

I don't understand what went wrong with the dimensions.

Thank you for any help!


Solution

  • I assume, you want to read in all .nc data and map the closest city to it. For that, I suggest to read all data first and afterwards calculate to which city a location belongs. The following code probably needs some adoptions to your data. It should show in which direction you could go to get the code more robust.

    Step 1: Import your 'raw' data

    e.g. into a DataFrame(s). Depends if you can import all data at once. If not split step 1 and 2 into chunks

    df_list = []
    for file in glob.glob('*.nc'):
        data = Dataset(file,'r')
        df_i = pd.DataFrame({
    variables.keys())
            'time': data.variables['time'][:],
            'lat': data.variables['lat'][:],
            'lon': data.variables['lon'][:],
            'tmp': data.variables['tmp'][:],
            'stn': data.variables['stn'][:],
            'year':  str(file)[4:13],  # maybe not needed as 'time' should have this info already, and [4:13] needs exactly this format
            'file_name': file,  # to track back the file
            # ... and more
            })
    
        df_list.append(df_i)
    
    df = pandas.concat(df_list, ignore_index=True)
    

    Second step: map the locations

    e.g. with groupby but there are several other methods. Depending on the amount of data, I suggest to use pandas or numpy routines over any python loops. They are way faster.

    df['city'] = None
    gp = df.groupby(['lon', 'lat'])
    for values_i, indexes_i in gp.groups.items():
        # Add your code to get the closest city
        # values_i[0] is 'lon'
        # values_i[1] is 'lat'
        
        # e.g.:
        diff_lon_lat = np.hypot(cities['lon']-values_i[0], cities['lat']-values_i[1])
        location = cities.loc[diff_lon_lat.argmin(), 'code_nbs']
        
        # and add the parameters to the df
        df.loc[indexes_i, 'city'] = location