Search code examples
pythonscikit-learnscipypython-xarray

Masking nan values from an xarray dataset for scikit.learn mulltiple linear regression following scipy


I'm attempting to use scikit-learn.linear_model's LinearRegression find the multiple linear regression coefficients for different variables at each latitude and longitude point along the time dimension like so:

for i in range(len(data.lat)):
    for j in range(len(data.lon)):
         storage_dframe[i, j, :] = LinearRegression().fit(np.array((data.ivar1.values[:, i, j].reshape(-1, 1),
                                                                    data.ivar2.values[:, i, j].reshape(-1, 1)).reshape(len(data.time), 2),
                                                          data.dvar.values[:, i, j].reshape(len(data.time)).coef_)

While this general form works, there are abundant NaN values in my data because it comes from real observations. I generally do not want to impute data whenever possible, trying to preserve whatever real relations there might be. Is it possible to copy a behavior from scipy.stats.linregress, where "Missing values are considered pair-wise: if a value is missing in x, the corresponding value in y is masked?" This feels like the best route; otherwise, could I add a conditional clause along the lines of

if data.ivar1[:, i, j].isnull() or data.ivar[:, i, j].isnull() == True:
     storage_dfram[i, j, :] = np.nan
else:
     storage_dframe[i, j, :] = LinearRegression().fit(np.array((data.ivar1.values[:, i, j].reshape(-1, 1),
                                                                data.ivar2.values[:, i, j].reshape(-1, 1)).reshape(len(data.time), 2),
                                                      data.dvar.values[:, i, j].reshape(len(data.time)).coef_)

I've attempted essentially that, with no success. Please feel free to chime in!


Solution

  • This boolean clause handles it:

    if data.isel(lat=i,lon=j).ivar1.isnull().any() or data.isel(lev=2,lat=i,lon=j).ivar2.isnull().any() or data.isel(lev=2, lat=i,lon=j).ivar3.isnull().any() or data.isel(lev=0, lat=i,lon=j).ivar4.isnull().any() or data2.isel(lat=i, lon=j).dvar.isnull().any() == True:
         storage_dframe[i, j, :] = np.nan
    else:
         storage_dframe[i, j, :] = LinearRegression(...)
    

    where ivarx is the xth independent variable and dvar is the dependent variable.