I want to perform OLS regression using python's statsmodels package. But my dataset has nans in it. Currently, I know I can use missing='drop' option when perform OLS regression but some of the results (fitted value or residuals) will have different lengths as the original y variable.
I have the following code as an example:
import numpy as np
import statsmodels.api as sm
yvars = np.array([1.0, 6.0, 3.0, 2.0, 8.0, 4.0, 5.0, 2.0, np.nan, 3.0])
xvars = np.array(
[
[1.0, 8.0],
[8.0, np.nan],
[np.nan, 3.0],
[3.0, 6.0],
[5.0, 3.0],
[2.0, 7.0],
[1.0, 3.0],
[2.0, 2.0],
[7.0, 9.0],
[3.0, 1.0],
]
)
res = sm.OLS(yvar, sm.add_constant(xvars), missing='drop').fit()
res.resid
The result is as follows:
array([-0.71907958, -1.9012464 , 1.78811122, 1.18983701, 2.63854267,
-1.45254075, -1.54362416])
My question is that the result is an array has length 7 (after dropping nans), but the length of yvar is 10. So, what if I want to return the residual of the same length as yvar and just output nan in whatever position where there are at least 1 nan in either yvar or xvars?
Basically, the result I want to get is:
array([-0.71907958, nan , nan , -1.9012464 , 1.78811122, 1.18983701, 2.63854267,
-1.45254075, nan , -1.54362416])
That's too difficult to implement in statsmodels. So users need to handle it themselves.
The results attributes like fittedvalues and resid are for the actual sample used.
The predict
method of the results instance preserves nans in the provided predict data exog
array, but other methods and attributes do not.
results.predict(xvars_all)
One workaround:
Use a pandas DataFrame for the data.
Then, AFAIR, resid
and fittedvalues
of the results instance are pandas Series with the appropriate index.
This can then be used to add those to the original index or DataFrame.
That's what the predict
method does.