Search code examples
pythonpython-xarrayhdf

Xarray: Loading several CSV files into a dataset


I have several comma-separated data files that I want to load into an xarray dataset. Each row in each file represents a different spatial value of a field in a fixed grid, and every file represents a different point in time. The grid spacing is fixed and unchanging in time. The spacing of the grid is not uniform. The ultimate goal is to compute max_{x, y} { std_t[ value(x, y, t) * sqrt(y **2 + x ** 2)] }, where sqrt is the square root, std_t is standard deviation with respect to time and max_{x, y} is the maximum across all space.

I am having trouble loading the data. It is not clear to me how one is supposed to load several CSV files into an xarray dataset. There is an open_mfdataset function, which is designed for loading several data files into a dataset, but seems to expect hdf5 or netcdf files.

It seems like there is no way to load regular CSV files into an xarray dataset, and that preprocessing the data is necessary. In my example, I decided to preprocess the csv files to hdf5 files beforehand, to make use of the h5netcdf engine. This has created what appears to be an hdf5-specific problem for me.

below is my best attempt at loading the data so far. Unfortunately, it results in an empty xarray dataset. I tried several options in the open_mfdataset function, and the following code is only one realization of several attempts at using the function.

How can I load these csv files into a single xarray dataset, to set myself up to find the maximum across space of the standard deviation in time of the value of interest?

import xarray as xr
import numpy as np
import pandas as pd

'''
Create example files
- Each file contains a spatial-dependent value, f(x, y)
- Each file represents a different point in time, f(x, y, t)

'''
for ii in range(7):

   # create csv file
   fl = open('exampleFile%i.dat' % ii, 'w')
   fl.write('time x1 x2 value\n')
   for xx in range(10):
      for yy in range(10):
         fl.write('%i %i %i %i\n' % 
                  (ii, xx, yy, (xx - yy) * np.exp(ii)))
   fl.close()

   # convert csv to hdf5
   dat = pd.read_csv('exampleFile%i.dat' % ii)
   dat.to_hdf('exampleFile%i.hdf5' % ii, 'data', mode='w')

'''
Read all files into xarray dataframe
   (the ultimate goal is to find the 
      maximum across time of 
      the standard deviation across space
      of the "value" column)
'''
result = xr.open_mfdataset('exampleFile*.hdf5', engine='h5netcdf', combine='nested')

... When I run the code, the result variable does not appear to contain the desired data:

In: result
Out: 
<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*
Attributes:
    PYTABLES_FORMAT_VERSION:  2.1
    TITLE:                    Empty(dtype=dtype('S1'))
    VERSION:                  1.0

Edit

An answer was posted that assumes a uniformly spaced spatial grid. Here is a slightly modified example that does not assume an evenly-spaced grid of spatial points.

The example also assumes three spatial dimensions. That is more true to my real problem, and I realized that might be an important detail in this simple example.

import xarray as xr
import numpy as np
import pandas as pd

'''
Create example files
- Each file contains a spatial-dependent value, f(x, y)
- Each file represents a different point in time, f(x, y, t)

'''
for ii in range(7):

   # create csv file
   fl = open('exampleFile%i.dat' % ii, 'w')
   fl.write('time x y z value\n')
   for xx in range(10):
      for yy in range(int(10 + xx // 2)):
         for zz in range(int(10 + xx //3 + yy // 3)):
            fl.write('%i %f %f %f %f\n' % 
                    (ii, xx * np.exp(- 1 * yy * zz) , yy * np.exp(xx - zz), zz * np.exp(xx * yy), (xx - yy) * np.exp(ii)))
   fl.close()

   # convert csv to hdf5
   dat = pd.read_csv('exampleFile%i.dat' % ii)
   dat.to_hdf('exampleFile%i.hdf5' % ii, 'data', mode='w')

'''
Read all files into xarray dataframe
   (the ultimate goal is to find the 
      maximum across time of 
      the standard deviation across space
      of the "value" column)
'''
result = xr.open_mfdataset('exampleFile*.hdf5', engine='h5netcdf', combine='nested')

Solution

  • I hope I understood your problem. See if this works for you.

    When defining the key arguments for read_csv, note that is is better using delim_whitespace=True instead of sep=" ". This will avoind considering double columns if somewhere you have double spaces.

    I am passing to read_csv that time,x,y and z are all coordinates and I am converting them to xarray. It will automatically structure your unstructured data and fill the holes with NaN. Then I am concatenating all xarray objects into a single object by time.

    from glob import glob
    
    fnames = glob('*.dat')
    fnames.sort()
    
    kw = dict(delim_whitespace=True,index_col=['time','x','y','z'])
    
    ds = xr.concat([pd.read_csv(fname,**kw).to_xarray() for fname in fnames],'time')
    

    The final result is an xarray object like this:

    repr for xarray object

    Now you can do everything with this object.

    ds.max(['x','y','z']).std('time') will return the standard deviation in time of the spatial maximum value for all variables (in this case it is only value column). Beware that sometimes you may have to pass skipna=True to avoid having NaN outputs from your analysis.

    Please, let me know it that solves your problem and I would be glad adapting it if it does not tackle some specific issue your are having with your data.