pythonpython-xarrayhdf# Xarray: Loading several CSV files into a dataset

I have several comma-separated data files that I want to load into an xarray dataset. Each row in each file represents a different spatial value of a field in a fixed grid, and every file represents a different point in time. The grid spacing is fixed and unchanging in time. The spacing of the grid is not uniform. The ultimate goal is to compute `max_{x, y} { std_t[ value(x, y, t) * sqrt(y **2 + x ** 2)] }`

, where sqrt is the square root, `std_t`

is standard deviation with respect to time and `max_{x, y}`

is the maximum across all space.

I am having trouble loading the data. It is not clear to me how one is supposed to load several CSV files into an xarray dataset. There is an `open_mfdataset`

function, which is designed for loading several data files into a dataset, but seems to expect hdf5 or netcdf files.

It seems like there is no way to load regular CSV files into an xarray dataset, and that preprocessing the data is necessary. In my example, I decided to preprocess the csv files to hdf5 files beforehand, to make use of the `h5netcdf`

engine. This has created what appears to be an hdf5-specific problem for me.

below is my best attempt at loading the data so far. Unfortunately, it results in an empty xarray dataset. I tried several options in the `open_mfdataset`

function, and the following code is only one realization of several attempts at using the function.

How can I load these csv files into a single xarray dataset, to set myself up to find the maximum across space of the standard deviation in time of the value of interest?

```
import xarray as xr
import numpy as np
import pandas as pd
'''
Create example files
- Each file contains a spatial-dependent value, f(x, y)
- Each file represents a different point in time, f(x, y, t)
'''
for ii in range(7):
# create csv file
fl = open('exampleFile%i.dat' % ii, 'w')
fl.write('time x1 x2 value\n')
for xx in range(10):
for yy in range(10):
fl.write('%i %i %i %i\n' %
(ii, xx, yy, (xx - yy) * np.exp(ii)))
fl.close()
# convert csv to hdf5
dat = pd.read_csv('exampleFile%i.dat' % ii)
dat.to_hdf('exampleFile%i.hdf5' % ii, 'data', mode='w')
'''
Read all files into xarray dataframe
(the ultimate goal is to find the
maximum across time of
the standard deviation across space
of the "value" column)
'''
result = xr.open_mfdataset('exampleFile*.hdf5', engine='h5netcdf', combine='nested')
```

... When I run the code, the `result`

variable does not appear to contain the desired data:

```
In: result
Out:
<xarray.Dataset>
Dimensions: ()
Data variables:
*empty*
Attributes:
PYTABLES_FORMAT_VERSION: 2.1
TITLE: Empty(dtype=dtype('S1'))
VERSION: 1.0
```

**Edit**

An answer was posted that assumes a uniformly spaced spatial grid. Here is a slightly modified example that does not assume an evenly-spaced grid of spatial points.

The example also assumes three spatial dimensions. That is more true to my real problem, and I realized that might be an important detail in this simple example.

```
import xarray as xr
import numpy as np
import pandas as pd
'''
Create example files
- Each file contains a spatial-dependent value, f(x, y)
- Each file represents a different point in time, f(x, y, t)
'''
for ii in range(7):
# create csv file
fl = open('exampleFile%i.dat' % ii, 'w')
fl.write('time x y z value\n')
for xx in range(10):
for yy in range(int(10 + xx // 2)):
for zz in range(int(10 + xx //3 + yy // 3)):
fl.write('%i %f %f %f %f\n' %
(ii, xx * np.exp(- 1 * yy * zz) , yy * np.exp(xx - zz), zz * np.exp(xx * yy), (xx - yy) * np.exp(ii)))
fl.close()
# convert csv to hdf5
dat = pd.read_csv('exampleFile%i.dat' % ii)
dat.to_hdf('exampleFile%i.hdf5' % ii, 'data', mode='w')
'''
Read all files into xarray dataframe
(the ultimate goal is to find the
maximum across time of
the standard deviation across space
of the "value" column)
'''
result = xr.open_mfdataset('exampleFile*.hdf5', engine='h5netcdf', combine='nested')
```

Solution

I hope I understood your problem. See if this works for you.

When defining the key arguments for `read_csv`

, note that is is better using `delim_whitespace=True`

instead of `sep=" "`

. This will avoind considering double columns if somewhere you have double spaces.

I am passing to `read_csv`

that `time`

,`x`

,`y`

and `z`

are all coordinates and I am converting them to `xarray`

. It will automatically structure your unstructured data and fill the holes with `NaN`

. Then I am concatenating all `xarray`

objects into a single object by `time`

.

```
from glob import glob
fnames = glob('*.dat')
fnames.sort()
kw = dict(delim_whitespace=True,index_col=['time','x','y','z'])
ds = xr.concat([pd.read_csv(fname,**kw).to_xarray() for fname in fnames],'time')
```

The final result is an `xarray`

object like this:

Now you can do everything with this object.

`ds.max(['x','y','z']).std('time')`

will return the standard deviation in time of the spatial maximum value for all variables (in this case it is only `value`

column). Beware that sometimes you may have to pass `skipna=True`

to avoid having `NaN`

outputs from your analysis.

Please, let me know it that solves your problem and I would be glad adapting it if it does not tackle some specific issue your are having with your data.

- Python Jinja2 LaTeX Table
- Getting attributes of a class
- How can I print many significant figures in Python?
- How to allow list append() method to return the new list
- Calculate Last Friday of Month in Pandas
- Python type hint for Iterable[str] that isn't str
- How to iterate over a list in chunks
- How to exit the entire application from a Python thread?
- Running shell command and capturing the output
- How do I pass a variable by reference?
- Convert range(r) to list of strings of length 2 in python
- How can I get the start and end dates for each week?
- how to use send_message() in python-telegram-bot
- Python conditional replacement based on element type
- How can I count the number of items in an arbitrary iterable (such as a generator)?
- Find longest consecutive range of numbers in list
- Insert text in braces with asyncpg
- How does one put a link / url to the web-site's home page in Django?
- How to determine if a path is a subdirectory of another?
- Custom Keybindings for Ipython terminal
- FastAPI asynchronous background tasks blocks other requests?
- How to make sure that information from one file is duplicated into several text documents, without specific lines
- Installing a Python environment with Anaconda
- sklearn pipeline model predicting same results for all input
- Brew command not found after installing Anaconda Python
- How to get an XPath from selenium webelement or from lxml?
- Pipe PuTTY console to Python script
- How to align the axes of a figure in matplotlib?
- Persist ParentDocumentRetriever of langchain
- How to reset index in a pandas dataframe?