I have a 1D array of independent variable values (x_array
) that match the timesteps in a 3D numpy array of spatial data with multiple time-steps (y_array
). My actual data is much larger: 300+ timesteps and up to 3000 * 3000 pixels:
import numpy as np
from scipy.stats import linregress
# Independent variable: four time-steps of 1-dimensional data
x_array = np.array([0.5, 0.2, 0.4, 0.4])
# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2, -0.2, -0.3],
[-0.3, -0.2, -0.3],
[-0.3, -0.4, -0.4]],
[[-0.2, -0.2, -0.4],
[-0.3, np.nan, -0.3],
[-0.3, -0.3, -0.4]],
[[np.nan, np.nan, -0.3],
[-0.2, -0.3, -0.7],
[-0.3, -0.3, -0.3]],
[[-0.1, -0.3, np.nan],
[-0.2, -0.3, np.nan],
[-0.1, np.nan, np.nan]]])
I want to compute a per-pixel linear regression and obtain R-squared, P-values, intercepts and slopes for each xy
pixel in y_array
, with values for each timestep in x_array
as my independent variable.
I can reshape to get the data in a form to input it into np.polyfit
which is vectorised and fast:
# Reshape so rows = number of time-steps and columns = pixels:
y_array_reshaped = y_array.reshape(len(y_array), -1)
# Do a first-degree polyfit
np.polyfit(x_array, y_array_reshaped, 1)
However, this ignores pixels that contain any NaN
values (np.polyfit
does not support NaN
values), and does not calculate the statistics I require (R-squared, P-values, intercepts and slopes).
The answer here uses scipy.stats import linregress
which does calculate the statistics I need, and suggests avoiding NaN
issues by masking out these NaN
values. However, this example is for two 1D arrays, and I can't work out how to apply a similar masking approach to my case where each column in y_array_reshaped
will have a different set of NaN
values.
My question: How can I calculate regression statistics for each pixel in a large multidimensional array (300 x 3000 x 3000) containing many NaN
values in a reasonably fast, vectorised way?
Desired result: A 3 x 3 array of regression statistic values (e.g. R-squared) for each pixel in y_array
, even if that pixel contains NaN
values at some point in the time series
This blog post mentioned in the comments above contains an incredibly fast vectorized function for cross-correlation, covariance, and regression for multi-dimensional data in Python. It produces all of the regression outputs I need, and does so in milliseconds as it relies entirely on simple vectorised array operations in xarray
.
May 2024 update: I've updated the function to ensure it function correctly accounts for different numbers of NaN values in each pixel, and to correctly calculate P-values for different alternative hypotheses (thanks big-zhao
!):
def xr_regression(x, y, lag_x=0, lag_y=0, dim="time", alternative="two-sided"):
"""
Takes two xr.Datarrays of any dimensions (input data could be a 1D
time series, or for example, have three dimensions e.g. time, lat,
lon), and returns covariance, correlation, coefficient of
determination, regression slope, intercept, p-value and standard
error, and number of valid observations (n) between the two datasets
along their aligned first dimension.
Datasets can be provided in any order, but note that the regression
slope and intercept will be calculated for y with respect to x.
Inspired by:
https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html
Parameters
----------
x, y : xarray DataArray
Two xarray DataArrays with any number of dimensions, both
sharing the same first dimension
lag_x, lag_y : int, optional
Optional integers giving lag values to assign to either of the
data, with lagx shifting x, and lagy shifting y with the
specified lag amount.
dim : str, optional
An optional string giving the name of the dimension on which to
align (and optionally lag) datasets. The default is 'time'.
alternative : string, optional
Defines the alternative hypothesis. Default is 'two-sided'.
The following options are available:
* 'two-sided': slope of the regression line is nonzero
* 'less': slope of the regression line is less than zero
* 'greater': slope of the regression line is greater than zero
Returns
-------
regression_ds : xarray.Dataset
A dataset comparing the two input datasets along their aligned
dimension, containing variables including covariance, correlation,
coefficient of determination, regression slope, intercept,
p-value and standard error, and number of valid observations (n).
"""
# Shift x and y data if lags are specified
if lag_x != 0:
# If x lags y by 1, x must be shifted 1 step backwards. But as
# the 'zero-th' value is nonexistant, xarray assigns it as
# invalid (nan). Hence it needs to be dropped
x = x.shift(**{dim: -lag_x}).dropna(dim=dim)
# Next re-align the two datasets so that y adjusts to the
# changed coordinates of x
x, y = xr.align(x, y)
if lag_y != 0:
y = y.shift(**{dim: -lag_y}).dropna(dim=dim)
# Ensure that the data are properly aligned to each other.
x, y = xr.align(x, y)
# Compute data length, mean and standard deviation along dim
n = y.notnull().sum(dim=dim)
xmean = x.mean(dim=dim)
ymean = y.mean(dim=dim)
xstd = x.std(dim=dim)
ystd = y.std(dim=dim)
# Compute covariance, correlation and coefficient of determination
cov = ((x - xmean) * (y - ymean)).sum(dim=dim) / (n)
cor = cov / (xstd * ystd)
r2 = cor**2
# Compute regression slope and intercept
slope = cov / (xstd**2)
intercept = ymean - xmean * slope
# Compute t-statistics and standard error
tstats = cor * np.sqrt(n - 2) / np.sqrt(1 - cor**2)
stderr = slope / tstats
# Calculate p-values for different alternative hypotheses.
if alternative == "two-sided":
pval = t.sf(np.abs(tstats), n - 2) * 2
elif alternative == "greater":
pval = t.sf(tstats, n - 2)
elif alternative == "less":
pval = t.cdf(np.abs(tstats), n - 2)
# Wrap p-values into an xr.DataArray
pval = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)
# Combine into single dataset
regression_ds = xr.merge(
[
cov.rename("cov").astype(np.float32),
cor.rename("cor").astype(np.float32),
r2.rename("r2").astype(np.float32),
slope.rename("slope").astype(np.float32),
intercept.rename("intercept").astype(np.float32),
pval.rename("pvalue").astype(np.float32),
stderr.rename("stderr").astype(np.float32),
n.rename("n").astype(np.int16),
]
)
return regression_ds