By calculating the p-value, I am reducing the number of features in a large sparse file. But I get this error. I have seen similar posts but this code works with non-sparse input. Can you help, please? (I can upload the input file if needed)
import statsmodels.formula.api as sm
def backwardElimination(x, Y, sl, columns):
numVars = len(x[0])
pvalue_removal_counter = 0
for i in range(0, numVars):
print(i, 'of', numVars)
regressor_OLS = sm.OLS(Y, x).fit()
maxVar = max(regressor_OLS.pvalues).astype(float)
if maxVar > sl:
for j in range(0, numVars - i):
if (regressor_OLS.pvalues[j].astype(float) == maxVar):
x = np.delete(x, j, 1)
pvalue_removal_counter += 1
columns = np.delete(columns, j)
regressor_OLS.summary()
return x, columns
Output:
0 of 1970
1 of 1970
2 of 1970
Traceback (most recent call last):
File "main.py", line 142, in <module>
selected_columns)
File "main.py", line 101, in backwardElimination
if (regressor_OLS.pvalues[j].astype(float) == maxVar):
IndexError: index 1967 is out of bounds for axis 0 with size 1967
Here is a fixed version.
I made a number of changes:
OLS
from statsmodels.apicolumns
in the functionnp.argmax
to find the location of the maximum valuex[:, [True, False, True]]
which keeps columns 0 and 2.import numpy as np
# Wrong import. Not using the formula interface, so using statsmodels.api
import statsmodels.api as sm
def backwardElimination(x, Y, sl):
numVars = x.shape[1] # variables in columns
columns = np.arange(numVars)
for i in range(0, numVars):
print(i, 'of', numVars)
regressor_OLS = sm.OLS(Y, x).fit()
if maxVar > sl:
# Use boolean selection
retain = np.ones(x.shape[1], bool)
drop = np.argmax(regressor_OLS.pvalues)
# Drop the highest pvalue(s)
retain[drop] = False
# Keep the x we with to retain
x = x[:, retain]
# Also keep their column indices
columns = columns[retain]
else:
# Exit early if everything has pval above sl
break
# Show the final summary
print(regressor_OLS.summary())
return x, columns
You can test it with
x = np.random.standard_normal((1000,100))
y = np.random.standard_normal(1000)
backwardElimination(x,y,0.1)