Search code examples
pythonpandascsvnetcdfnetcdf4

Convert time series data from csv to netCDF python


Main problem during this process is the code below:

precip[:] = orig

Produces an error of:

ValueError: cannot reshape array of size 5732784 into shape (39811,144,144)

I have two CSV files, one of the CSV file contains all the actual data of a variable (precipitation), with each column as a station, and their corresponding coordinates is in the second separate CSV file. My sample data is in google drive here.

If you want to have a look at the data itself, but my 1st CSV file has the shape (39811, 144) and 2nd CSV file has the shape (171, 10) but note; I'm only using the sliced dataframe as (144, 2).

This is the code:

stations = pd.read_csv(stn_precip)
stncoords = stations.iloc[:,[0,1]][:144]
orig = pd.read_csv(orig_precip, skiprows = 1, names = stations['Code'][:144])

lons = stncoords['X']
lats = stncoords['Y']

ncout = netCDF4.Dataset('Precip_1910-2018_homomod.nc', 'w')

ncout.createDimension('longitude',lons.shape[0])
ncout.createDimension('latitude',lats.shape[0])
ncout.createDimension('precip',orig.shape[1])
ncout.createDimension('time',orig.shape[0])

lons_out = lons.tolist()
lats_out = lats.tolist()
time_out = orig.index.tolist()

lats = ncout.createVariable('latitude',np.dtype('float32').char,('latitude',))
lons = ncout.createVariable('longitude',np.dtype('float32').char,('longitude',))
time = ncout.createVariable('time',np.dtype('float32').char,('time',))
precip = ncout.createVariable('precip',np.dtype('float32').char,('time', 'longitude','latitude'))

lats[:] = lats_out
lons[:] = lons_out
time[:] = time_out
precip[:] = orig
ncout.close()

I'm mostly basing my code to this post: convert-csv-to-netcdf but does not include the variable 'TIME' as a 3rd dimension, so that's where I'm failing. I think I should be expecting the precipitation variable to have a shape in the form (39811, 144, 144), but the error suggests otherwise.

Not exactly sure how to deal with this, any inputs are appreciated.


Solution

  • As you have data from different stations, I would suggest using dimension station for your netCDF file and not separate lon and lat. Of course, you can save the longitude and latitude of each station to separate variable.

    Here is one possible solution, using your code as an example:

    #!/usr/bin/env ipython
    import pandas as pd
    import numpy as np
    import netCDF4
    
    stn_precip='Precip_1910-2018_stations.csv'
    orig_precip='Precip_1910-2018_origvals.csv'
    stations = pd.read_csv(stn_precip)
    stncoords = stations.iloc[:,[0,1]][:144]
    orig = pd.read_csv(orig_precip, skiprows = 1, names = stations['Code'][:144])
    
    lons = stncoords['X']
    lats = stncoords['Y']
    nstations = np.size(lons)
    
    ncout = netCDF4.Dataset('Precip_1910-2018_homomod.nc', 'w')
    
    ncout.createDimension('station',nstations)
    ncout.createDimension('time',orig.shape[0])
    
    lons_out = lons.tolist()
    lats_out = lats.tolist()
    time_out = orig.index.tolist()
    
    lats = ncout.createVariable('latitude',np.dtype('float32').char,('station',))
    lons = ncout.createVariable('longitude',np.dtype('float32').char,('station',))
    time = ncout.createVariable('time',np.dtype('float32').char,('time',))
    precip = ncout.createVariable('precip',np.dtype('float32').char,('time', 'station'))
    
    lats[:] = lats_out
    lons[:] = lons_out
    time[:] = time_out
    precip[:] = orig
    ncout.close()
    

    So the information about output file (ncdump -h Precip_1910-2018_homomod.nc) is like this: enter image description here