Search code examples
pythonnumpystatsmodelsp-valueindex-error

IndexError: index 1967 is out of bounds for axis 0 with size 1967


By calculating the p-value, I am reducing the number of features in a large sparse file. But I get this error. I have seen similar posts but this code works with non-sparse input. Can you help, please? (I can upload the input file if needed)

import statsmodels.formula.api as sm

def backwardElimination(x, Y, sl, columns):
    numVars = len(x[0])
    pvalue_removal_counter = 0

    for i in range(0, numVars):
        print(i, 'of', numVars)
        regressor_OLS = sm.OLS(Y, x).fit()
        maxVar = max(regressor_OLS.pvalues).astype(float)

        if maxVar > sl:
            for j in range(0, numVars - i):
                if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                    x = np.delete(x, j, 1)
                    pvalue_removal_counter += 1
                    columns = np.delete(columns, j)

    regressor_OLS.summary()
    return x, columns

Output:

0 of 1970
1 of 1970
2 of 1970
Traceback (most recent call last):
  File "main.py", line 142, in <module>
    selected_columns)
  File "main.py", line 101, in backwardElimination
    if (regressor_OLS.pvalues[j].astype(float) == maxVar):
IndexError: index 1967 is out of bounds for axis 0 with size 1967

Solution

  • Here is a fixed version.

    I made a number of changes:

    1. Import the correct OLS from statsmodels.api
    2. Generate columns in the function
    3. Use np.argmax to find the location of the maximum value
    4. Use a boolean index to select columns. In pseudo-code it is like x[:, [True, False, True]] which keeps columns 0 and 2.
    5. Stop if there is nothing to drop.
    import numpy as np
    # Wrong import. Not using the formula interface, so using statsmodels.api
    import statsmodels.api as sm
    
    def backwardElimination(x, Y, sl):
        numVars = x.shape[1]  # variables in columns
        columns = np.arange(numVars)
    
        for i in range(0, numVars):
            print(i, 'of', numVars)
            regressor_OLS = sm.OLS(Y, x).fit()
    
            if maxVar > sl:
                # Use boolean selection
                retain = np.ones(x.shape[1], bool)
                drop = np.argmax(regressor_OLS.pvalues)
                # Drop the highest pvalue(s)
                retain[drop] = False
                # Keep the x we with to retain
                x = x[:, retain]
                # Also keep their column indices
                columns = columns[retain]
            else:
                # Exit early if everything has pval above sl
                break
    
        # Show the final summary
        print(regressor_OLS.summary())
        return x, columns
    

    You can test it with

    x = np.random.standard_normal((1000,100))
    y = np.random.standard_normal(1000)
    backwardElimination(x,y,0.1)