I currently have a 4D dataset ds
in xarray that looks like this:
<xarray.Dataset>
Dimensions: (lat: 60, lon: 78, time: 216, pres: 395)
Coordinates:
* lat (lat) float32 0.5 1.5 2.5 3.5 4.5 5.5 ... 55.5 56.5 57.5 58.5 59.5
* lon (lon) float32 -45.5 -44.5 -43.5 -42.5 ... -69.5 -75.5 -74.5 -76.5
* time (time) float32 7.32e+05 7.32e+05 7.32e+05 ... 7.385e+05 7.385e+05
* pres (pres) float64 2.5 7.5 12.5 17.5 ... 1.962e+03 1.968e+03 1.972e+03
Data variables:
var (pres, lat, lon, time) float64 2.03e+03 2.03e+03 ... nan nan>
My goal is to turn it into a pandas df that looks like this:
id time pres param 20.5-70.5 20.5-71.5 20.5-72.5
0 0 0 var 2085 2073 2057
1 0 1 var 2114 2156 2054
2 0 2 var 2039 2006 2179
3 1 0 var 2199 2144 2033
4 1 1 var 2056 2102 2191
5 1 2 var 2062 2033 2052
6 2 0 var 2001 2153 2170
7 2 1 var 2187 2120 2100
8 2 2 var 2138 2076 2002
where I have some sort of multi-index with time
and pres
, a param
column (because I will likely have more than one variable at a time), and each pixel (so paired lat-lon) as a column header, so that for each pixel column I have the var
value corresponding to the time
and pres
. I need this format for the next part of my analysis that includes some vectorization.
I tried a few things, including stacked = ds.stack(coordinates=["lat", "lon"])
which I think is the start of what I want to do, to then do stacked.to_dataframe()
but then the latter unstacks my paired coordinates. I think I'm missing something here but I'm not too sure how to go about this?
Any help is greatly appreciated!
Thanks
(Since you don't provide a sample, you should adapt this example)
Use stack/unstack
to reshape your dataset:
import xarray as xr
import pandas as pd
ds = xr.tutorial.load_dataset('air_temperature')
df = ds.to_dataframe().rename_axis(columns='param').stack('param').unstack(['lat', 'lon'])
df.columns = [f"{lat}-{lon}" for lat, lon in df.columns]
Output:
>>> df
75.0-200.0 75.0-202.5 75.0-205.0 75.0-207.5 ... 15.0-322.5 15.0-325.0 15.0-327.5 15.0-330.0
time param ...
2013-01-01 00:00:00 air 241.199997 242.500000 243.500000 244.000000 ... 297.600006 296.899994 296.790009 296.600006
2013-01-01 06:00:00 air 242.099991 242.699997 243.099991 243.389999 ... 296.899994 296.399994 296.399994 296.600006
2013-01-01 12:00:00 air 242.299988 242.199997 242.299988 242.500000 ... 297.600006 297.000000 297.000000 296.790009
2013-01-01 18:00:00 air 241.889999 241.799988 241.799988 242.099991 ... 298.199982 297.790009 298.000000 297.899994
2013-01-02 00:00:00 air 243.199997 243.099991 243.099991 243.299988 ... 297.699982 297.100006 297.399994 297.399994
... ... ... ... ... ... ... ... ... ...
2014-12-30 18:00:00 air 243.089996 243.389999 243.689987 243.789993 ... 297.989990 297.389984 296.889984 296.089996
2014-12-31 00:00:00 air 242.489990 242.389999 242.189987 241.689987 ... 297.290009 296.589996 295.989990 295.489990
2014-12-31 06:00:00 air 243.489990 242.989990 242.089996 240.689987 ... 297.089996 296.089996 295.790009 295.790009
2014-12-31 12:00:00 air 245.789993 244.789993 243.489990 241.889999 ... 296.589996 295.690002 295.489990 295.190002
2014-12-31 18:00:00 air 245.089996 244.289993 243.289993 242.189987 ... 297.190002 296.489990 296.190002 295.690002
[2920 rows x 1325 columns]
>>> ds
<xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7