python numpy statsmodels p-value index-error

IndexError: index 1967 is out of bounds for axis 0 with size 1967

By calculating the p-value, I am reducing the number of features in a large sparse file. But I get this error. I have seen similar posts but this code works with non-sparse input. Can you help, please? (I can upload the input file if needed)

import statsmodels.formula.api as sm

def backwardElimination(x, Y, sl, columns):
    numVars = len(x[0])
    pvalue_removal_counter = 0

    for i in range(0, numVars):
        print(i, 'of', numVars)
        regressor_OLS = sm.OLS(Y, x).fit()
        maxVar = max(regressor_OLS.pvalues).astype(float)

        if maxVar > sl:
            for j in range(0, numVars - i):
                if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                    x = np.delete(x, j, 1)
                    pvalue_removal_counter += 1
                    columns = np.delete(columns, j)

    regressor_OLS.summary()
    return x, columns

Output:

0 of 1970
1 of 1970
2 of 1970
Traceback (most recent call last):
  File "main.py", line 142, in <module>
    selected_columns)
  File "main.py", line 101, in backwardElimination
    if (regressor_OLS.pvalues[j].astype(float) == maxVar):
IndexError: index 1967 is out of bounds for axis 0 with size 1967

Solution

Here is a fixed version.

I made a number of changes:

Import the correct OLS from statsmodels.api
Generate columns in the function
Use np.argmax to find the location of the maximum value
Use a boolean index to select columns. In pseudo-code it is like x[:, [True, False, True]] which keeps columns 0 and 2.
Stop if there is nothing to drop.

import numpy as np
# Wrong import. Not using the formula interface, so using statsmodels.api
import statsmodels.api as sm

def backwardElimination(x, Y, sl):
    numVars = x.shape[1]  # variables in columns
    columns = np.arange(numVars)

    for i in range(0, numVars):
        print(i, 'of', numVars)
        regressor_OLS = sm.OLS(Y, x).fit()

        if maxVar > sl:
            # Use boolean selection
            retain = np.ones(x.shape[1], bool)
            drop = np.argmax(regressor_OLS.pvalues)
            # Drop the highest pvalue(s)
            retain[drop] = False
            # Keep the x we with to retain
            x = x[:, retain]
            # Also keep their column indices
            columns = columns[retain]
        else:
            # Exit early if everything has pval above sl
            break

    # Show the final summary
    print(regressor_OLS.summary())
    return x, columns

You can test it with

x = np.random.standard_normal((1000,100))
y = np.random.standard_normal(1000)
backwardElimination(x,y,0.1)