Search code examples
pythonpandaspatsybonferroni

How to remove features from regression results using bonferroni correction results?


I implemented a regression model using

formula= "cost ~ C(state) + group_size + C(homeowner) + car_age + C(car_value) + 
risk_factor + age_oldest + age_youngest + C(married_couple) + c_previous + 
duration_previous + C(a) + C(b) + C(c) + C(d) + C(e) + C(f) + C(g)"

model_a = smf.ols(formula = formula, data = train).fit()
model_a.summary()

After fitting a regression model, I ran a bonferroni correction using

smt.multipletests(model_a.pvalues, alpha=0.05, method='bonferroni', is_sorted=False, 
returnsorted=False)

And I get the following result:

(array([ True, False,  True,  True,  True,  True,  True, False,  True,
     True,  True, False,  True,  True,  True,  True, False, False,
    False, False,  True, False,  True,  True,  True,  True,  True,
     True,  True, False,  True,  True, False,  True,  True, False,
     True,  True,  True,  True,  True,  True,  True,  True, False,
     True,  True,  True, False, False, False, False, False, False,
     True,  True,  True,  True,  True, False,  True, False,  True,
    False,  True,  True,  True,  True]),
 array([0.00000000e+00, 1.00000000e+00, 1.45352365e-03, 2.14422252e-21,
    5.68726115e-13, 4.81466313e-12, 1.22517937e-05, 3.36565323e-01,
    4.81396354e-45, 1.51138583e-05, 4.27572151e-04, 1.00000000e+00,
    5.91690245e-10, 2.62041907e-16, 3.12129589e-18, 9.88879325e-13,
    1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 6.85853188e-01,
    8.94886169e-07, 1.00000000e+00, 3.55801455e-12, 5.35987286e-54,
    7.77655333e-03, 5.45090922e-04, 5.15690091e-03, 7.40791788e-04,
    1.24797586e-07, 1.00000000e+00, 2.91991310e-04, 1.75502703e-07,
    1.00000000e+00, 2.57023089e-26, 2.34824045e-10, 1.00000000e+00,
    2.79360586e-87, 5.26115182e-09, 4.94812967e-08, 3.36073545e-07,
    5.06333547e-07, 4.44900552e-07, 1.06078148e-05, 1.42866234e-03,
    1.00000000e+00, 3.72074539e-10, 1.38294896e-74, 1.39540646e-69,
    1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
    1.00000000e+00, 1.00000000e+00, 2.78538149e-18, 3.74576314e-22,
    1.12111501e-19, 1.14698339e-04, 9.34411232e-18, 1.00000000e+00,
    4.10430857e-02, 1.00000000e+00, 5.35030644e-23, 1.00000000e+00,
    7.61651080e-20, 9.49735915e-56, 7.90523832e-66, 8.15390766e-94]),
 0.0007540287301109894,
 0.0007352941176470588)

I want to use these arrays to remove the features in model_a that are False and create a new model 'train_simplified'.

I'm using the following manual approach, but I want to know if there´s a more efficient way to do it.

train_simplified = train.drop(train.columns[[0, 1, 2, 4, 10, 16, 25, 27, 28, 30, 36, 38, 
41, 44, 47, 55, 61, 62, 63, 64, 65, 66, 67, 68, 69, 75, 78]], axis=1)

Solution

  • You could use Pandas loc to select only the features in model_a that are True.

    .loc[] is primarily label based, but may also be used with a boolean array.

    train = pd.DataFrame(np.random.rand(5,68))
              0         1         2         3  ...        63        64        65        66        67
    0  0.637557  0.887213  0.472215  0.119594  ...  0.908266  0.239562  0.144895  0.489453  0.985650
    1  0.242055  0.672136  0.761620  0.237638  ...  0.649633  0.849223  0.657613  0.568309  0.093675
    2  0.367716  0.265202  0.243990  0.973011  ...  0.465598  0.542645  0.286541  0.590833  0.030500
    3  0.037348  0.822601  0.360191  0.127061  ...  0.070569  0.642419  0.026511  0.585776  0.940230
    4  0.575474  0.388170  0.643288  0.458253  ...  0.091206  0.494420  0.057559  0.549529  0.441531
    
    [5 rows x 68 columns]
    
    keep_columns = np.array([ # array from smt.multipletests
        True, False,  True,  True,  True,  True,  True, False,  True,
        True,  True, False,  True,  True,  True,  True, False, False,
        False, False,  True, False,  True,  True,  True,  True,  True,
        True,  True, False,  True,  True, False,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True, False, False, False, False, False, False,
        True,  True,  True,  True,  True, False,  True, False,  True,
        False,  True,  True,  True,  True])
    np.sum(keep_columns) # 47 (keep 47 columns)
    
    train_simplified = train.loc[:,keep_columns]
    

    Output from train_simplified

              0         2         3         4  ...        62        64        65        66        67
    0  0.637557  0.472215  0.119594  0.713245  ...  0.278646  0.239562  0.144895  0.489453  0.985650
    1  0.242055  0.761620  0.237638  0.728216  ...  0.746491  0.849223  0.657613  0.568309  0.093675
    2  0.367716  0.243990  0.973011  0.393098  ...  0.035942  0.542645  0.286541  0.590833  0.030500
    3  0.037348  0.360191  0.127061  0.522243  ...  0.162934  0.642419  0.026511  0.585776  0.940230
    4  0.575474  0.643288  0.458253  0.545617  ...  0.789618  0.494420  0.057559  0.549529  0.441531
    
    [5 rows x 47 columns]