I have 7 2D cloud optimised geotiffs stacked into one data array in xarray. They are very large so I am using the intake-xarray extension and dask for streaming the data from s3 without using any RAM. I have concatenated them along their "band" dimension to stack them.
catalog = intake.open_catalog("s3://example-data/datasets.yml")
datasets = ['dem',
'dem_slope',
'dem_slope_aspect',
'distance_railways',
'distance_river',
'distance_road',
'worldcover']
to_concat = []
for data in datasets:
x = catalog[data].to_dask()
to_concat.append(x)
merged = xr.concat(to_concat, dim='band')
merged.coords['band'] = datasets # add labels to band dimension
y_array = catalog["NASA_global"]
y_array.coords['band'] = 'NASA_global'
merged
<xarray.DataArray (band: 7, y: 225000, x: 450000)>
dask.array<concatenate, shape=(7, 225000, 450000), dtype=float32, chunksize=(1, 256, 256), chunktype=numpy.ndarray>
Coordinates:
* band (band) <U31 'dem' ... 'worldcover'
* y (y) float64 90.0 90.0 90.0 90.0 90.0 ... -90.0 -90.0 -90.0 -90.0
* x (x) float64 -180.0 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 180.0
Attributes:
transform: (0.0008, 0.0, -180.0, 0.0, -0.0008, 90.0)
crs: +init=epsg:4326
res: (0.0008, 0.0008)
is_tiled: 1
nodatavals: (32767.0,)
scales: (1.0,)
offsets: (0.0,)
AREA_OR_POINT: Area
My question is how I could now convert the data into a several 1D columns, equivalent to flattening a 2D array in numpy? I have looked at .squeeze() to remove dimension but can't get it into the desired format. I want to do some machine learning and need it in a suitable format. New to dask and xarray.
I'd really appreciate any help or advice.
For anyone interested, I worked out how to do it in Xarray but it blew up the memory of my instance.
# load the intake catalog
catalog = intake.open_catalog("s3://example-data/datasets.yml")
datasets = ['dem',
'dem_slope',
'dem_slope_aspect',
'distance_railways',
'distance_river',
'distance_road',
'worldcover']
to_concat = []
for data in datasets:
x = catalog[data].to_dask()
to_concat.append(x)
# define X and y
merged = xr.concat(to_concat, dim='band').sel(x=slice(-124, -66), y=slice(50, 24))
merged.coords['band'] = datasets
X_array = merged.values
y_array = catalog["NASA_global"].to_dask()
y_array.coords['band'] = 'NASA_global'
# reshape
X_temp = X_array.stack(z=('x','y'))
X = X_temp.transpose('z', 'band')
Calling X_array = merged.values
loads everything into a numpy array and kills the instance. A colleague worked out a better solution that doesn't eat memory:
catalog = intake.open_catalog("s3://example-data/datasets.yml")
datasets = ['dem',
'dem_slope',
'dem_slope_aspect',
'distance_railways',
'distance_river',
'distance_road',
'worldcover']
to_concat = []
for data in datasets:
x = catalog[data].to_dask()
to_concat.append(x)
# define X and y
X_array = xr.concat(to_concat, dim='band').sel(x=slice(-124, -66), y=slice(50, 24))
X_array.coords['band'] = datasets
y_array = catalog["NASA_global"].to_dask()
# reshape
X_table = X_array.data.reshape((7, -1), merge_chunks=True)
y_table = y_array.data.reshape((1, -1), merge_chunks=True)
X = dd.from_dask_array(X_table.T, columns=datasets)
y = dd.from_dask_array(y_table.T, columns=['NASA_global'])