Search code examples
pythonmatrixregressionbeta

Python Function to Compute a Beta Matrix


I'm looking for an efficient function to automatically produce betas for every possible multiple regression model given a dependent variable and set of predictors as a DataFrame in python.

For example, given this set of data:

enter image description here

https://i.sstatic.net/YuPuv.jpg
The dependent variable is 'Cases per Capita' and the columns following are the predictor variables.

In a simpler example:


  Student   Grade    Hours Slept   Hours Studied   ...  
 --------- -------- ------------- --------------- ----- 
  A             90             9               1   ...  
  B             85             7               2   ...  
  C            100             4               5   ...  
  ...          ...           ...             ...   ...  

where the beta matrix output would look as such:


  Regression   Hours Slept   Hours Studied  
 ------------ ------------- --------------- 
           1   #             N/A            
           2   N/A           #              
           3   #             #              

The table size would be [2^n - 1] where n is the number of variables, so in the case with 5 predictors and 1 dependent, there would be 31 regressions, each with a different possible combination of beta calculations.

The process is described in greater detail here and an actual solution that is written in R is posted here.


Solution

  • I am not aware of any package that already does this. But you can create all those combinations (2^n-1), where n is the number of columns in X (independent variables), and fit a linear regression model for each combination and then get coefficients/betas for each model.

    Here is how I would do it, hope this helps

    from sklearn import datasets, linear_model
    import numpy as np
    from itertools import combinations
    
    #test dataset
    X, y = datasets.load_boston(return_X_y=True)
    
    X = X[:,:3] # Orginal X has 13 columns, only taking n=3 instead of 13 columns
    
    #create all 2^n-1 (here 7 because n=3) combinations of columns, where n is the number of features/indepdent variables
    
    all_combs = [] 
    for i in range(X.shape[1]):
        all_combs.extend(combinations(range(X.shape[1]),i+1))
    
    # print 2^n-1 combinations
    print('2^n-1 combinations are:')
    print(all_combs) 
    
     ## Create a betas/coefficients as zero matrix with rows (2^n-1) and columns equal to X
    betas = np.zeros([len(all_combs), X.shape[1]])+np.NaN
    
    ## Fit a model for each combination of columns and add the coefficients into betas matrix
    lr = linear_model.LinearRegression()
    for regression_no, comb in enumerate(all_combs):
        lr.fit(X[:,comb], y)
        betas[regression_no, comb] = lr.coef_
    
    ## Print Coefficients of each model
    print('Regression No'.center(15)+" ".join(['column {}'.format(i).center(10) for i in range(X.shape[1])]))  
    print('_'*50)
    for index, beta in enumerate(betas):
        print('{}'.format(index + 1).center(15), " ".join(['{:.4f}'.format(beta[i]).center(10) for i in range(X.shape[1])]))
    

    results in

    2^n-1 combinations are:
    [(0,), (1,), (2,), (0, 1), (0, 2), (1, 2), (0, 1, 2)]
    
    
        Regression No  column 0   column 1   column 2 
    __________________________________________________
           1         -0.4152      nan        nan    
           2           nan       0.1421      nan    
           3           nan        nan      -0.6485  
           4         -0.3521     0.1161      nan    
           5         -0.2455      nan      -0.5234  
           6           nan       0.0564    -0.5462  
           7         -0.2486     0.0585    -0.4156