Search code examples
pythonarraysnumpyscipypython-xarray

How to apply linear regression to every pixel in a large multi-dimensional array containing NaNs?


I have a 1D array of independent variable values (x_array) that match the timesteps in a 3D numpy array of spatial data with multiple time-steps (y_array). My actual data is much larger: 300+ timesteps and up to 3000 * 3000 pixels:

import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

I want to compute a per-pixel linear regression and obtain R-squared, P-values, intercepts and slopes for each xy pixel in y_array, with values for each timestep in x_array as my independent variable.

I can reshape to get the data in a form to input it into np.polyfit which is vectorised and fast:

# Reshape so rows = number of time-steps and columns = pixels:
y_array_reshaped = y_array.reshape(len(y_array), -1)

# Do a first-degree polyfit
np.polyfit(x_array, y_array_reshaped, 1)

However, this ignores pixels that contain any NaN values (np.polyfit does not support NaN values), and does not calculate the statistics I require (R-squared, P-values, intercepts and slopes).

The answer here uses scipy.stats import linregress which does calculate the statistics I need, and suggests avoiding NaN issues by masking out these NaN values. However, this example is for two 1D arrays, and I can't work out how to apply a similar masking approach to my case where each column in y_array_reshaped will have a different set of NaN values.

My question: How can I calculate regression statistics for each pixel in a large multidimensional array (300 x 3000 x 3000) containing many NaN values in a reasonably fast, vectorised way?

Desired result: A 3 x 3 array of regression statistic values (e.g. R-squared) for each pixel in y_array, even if that pixel contains NaN values at some point in the time series


Solution

  • This blog post mentioned in the comments above contains an incredibly fast vectorized function for cross-correlation, covariance, and regression for multi-dimensional data in Python. It produces all of the regression outputs I need, and does so in milliseconds as it relies entirely on simple vectorised array operations in xarray.

    May 2024 update: I've updated the function to ensure it function correctly accounts for different numbers of NaN values in each pixel, and to correctly calculate P-values for different alternative hypotheses (thanks big-zhao!):

    def xr_regression(x, y, lag_x=0, lag_y=0, dim="time", alternative="two-sided"):
        """
        Takes two xr.Datarrays of any dimensions (input data could be a 1D
        time series, or for example, have three dimensions e.g. time, lat,
        lon), and returns covariance, correlation, coefficient of
        determination, regression slope, intercept, p-value and standard
        error, and number of valid observations (n) between the two datasets
        along their aligned first dimension.
    
        Datasets can be provided in any order, but note that the regression
        slope and intercept will be calculated for y with respect to x.
    
        Inspired by:
        https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html
    
        Parameters
        ----------
        x, y : xarray DataArray
            Two xarray DataArrays with any number of dimensions, both
            sharing the same first dimension
        lag_x, lag_y : int, optional
            Optional integers giving lag values to assign to either of the
            data, with lagx shifting x, and lagy shifting y with the
            specified lag amount.
        dim : str, optional
            An optional string giving the name of the dimension on which to
            align (and optionally lag) datasets. The default is 'time'.
        alternative : string, optional
            Defines the alternative hypothesis. Default is 'two-sided'.
            The following options are available:
    
            * 'two-sided': slope of the regression line is nonzero
            * 'less': slope of the regression line is less than zero
            * 'greater':  slope of the regression line is greater than zero
    
        Returns
        -------
        regression_ds : xarray.Dataset
            A dataset comparing the two input datasets along their aligned
            dimension, containing variables including covariance, correlation,
            coefficient of determination, regression slope, intercept,
            p-value and standard error, and number of valid observations (n).
    
        """
    
        # Shift x and y data if lags are specified
        if lag_x != 0:
            # If x lags y by 1, x must be shifted 1 step backwards. But as
            # the 'zero-th' value is nonexistant, xarray assigns it as
            # invalid (nan). Hence it needs to be dropped
            x = x.shift(**{dim: -lag_x}).dropna(dim=dim)
    
            # Next re-align the two datasets so that y adjusts to the
            # changed coordinates of x
            x, y = xr.align(x, y)
    
        if lag_y != 0:
            y = y.shift(**{dim: -lag_y}).dropna(dim=dim)
    
        # Ensure that the data are properly aligned to each other.
        x, y = xr.align(x, y)
    
        # Compute data length, mean and standard deviation along dim
        n = y.notnull().sum(dim=dim)
        xmean = x.mean(dim=dim)
        ymean = y.mean(dim=dim)
        xstd = x.std(dim=dim)
        ystd = y.std(dim=dim)
    
        # Compute covariance, correlation and coefficient of determination
        cov = ((x - xmean) * (y - ymean)).sum(dim=dim) / (n)
        cor = cov / (xstd * ystd)
        r2 = cor**2
    
        # Compute regression slope and intercept
        slope = cov / (xstd**2)
        intercept = ymean - xmean * slope
    
        # Compute t-statistics and standard error
        tstats = cor * np.sqrt(n - 2) / np.sqrt(1 - cor**2)
        stderr = slope / tstats
    
        # Calculate p-values for different alternative hypotheses.
        if alternative == "two-sided":
            pval = t.sf(np.abs(tstats), n - 2) * 2
        elif alternative == "greater":
            pval = t.sf(tstats, n - 2)
        elif alternative == "less":
            pval = t.cdf(np.abs(tstats), n - 2)
            
        # Wrap p-values into an xr.DataArray
        pval = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)
    
        # Combine into single dataset
        regression_ds = xr.merge(
            [
                cov.rename("cov").astype(np.float32),
                cor.rename("cor").astype(np.float32),
                r2.rename("r2").astype(np.float32),
                slope.rename("slope").astype(np.float32),
                intercept.rename("intercept").astype(np.float32),
                pval.rename("pvalue").astype(np.float32),
                stderr.rename("stderr").astype(np.float32),
                n.rename("n").astype(np.int16),
            ]
        )
    
        return regression_ds