Search code examples
pythonmultiprocessinglinear-regressionpython-multiprocessingstatsmodels

python statsmodels linear regression and multiprocessing pool


I want to use multiprocessing for linear regression modelling when calculating different confidence intervals for the model.

In this example I'm using the dataset from https://www.geeksforgeeks.org/linear-regression-in-python-using-statsmodels/.

I've fitted the model. I've viewed the confidence interval print model.summary(), and the default confidence interval is 95%. I know that you can set the confidence interval to eg 99% using the alpha argument in model.summary(alpha=0.01).

I'd expect the output of my code to be a list of summaries with different confidence intervals. The problem with the code below is that every summary in the list has the same default 95% confidence interval. So clearly passing the different confidence intervals isn't working. But how do I make it work?

Thanks!

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from multiprocessing import Pool

# Data
df = pd.read_csv('C:\\Users\\Me\\Desktop\\headbrain1.csv')

# Model
df.columns = ['Head_size', 'Brain_weight']
model = smf.ols(formula='Head_size ~ Brain_weight', data=df).fit()

# Summaries 
if __name__ == "__main__":
    pool = Pool()
    summaries_list = pool.map(model.summary, [0.05, 0.04 0.01])   
    print(summaries_list)               

Solution

  • The issue is that the model.summary() method in statsmodels doesn't directly accept the alpha parameter when used with Pool.map. To get around this, you can use a lambda function or Python's functools.partial to create a wrapper that accepts the alpha value.

    Here's one way to modify your code using a lambda function:

    import numpy as np
    import pandas as pd
    import statsmodels.formula.api as smf
    from multiprocessing import Pool
    
    
    df = pd.read_csv('C:\\Users\\Me\\Desktop\\headbrain1.csv')
    df.columns = ['Head_size', 'Brain_weight']
    model = smf.ols(formula='Head_size ~ Brain_weight', data=df).fit()
    
    # Function to get summary with different alpha values
    def get_summary(alpha):
        return model.summary(alpha=alpha)
    
    # Summaries 
    if __name__ == "__main__":
        pool = Pool()
        alphas = [0.05, 0.04, 0.01]
        summaries_list = pool.map(get_summary, alphas)
        pool.close()
        pool.join()
        for summary in summaries_list:
            print(summary)