Search code examples
pythonpandasmulti-indexpanel-datastatistics-bootstrap

Creating a bootstrap sample by group in python


I have a dataframe looking something like that:

         y   X1  X2  X3
ID year
1  2010  1   2   3   4
1  2011  3   4   5   6
2  2010  1   2   3   4
2  2011  3   4   5   6
2  2012  7   8   9  10
...

I'd like to create several bootstrap sample from the original df, calculate a fixed effects panel regression on the new bootstrap samples and than store the corresponding beta coefficients. The approach I found for "normal" linear regression is the following

betas = pd.DataFrame()
for i in range(10):
    # Creating a bootstrap sample with replacement
    bootstrap = df.sample(n=df.shape[0], replace=True)
    # Fit the regression and save beta coefficients
    DV_bs = bootstrap.y
    IV_bs = sm2.add_constant(bootstrap[['X1', 'X2', 'X3']])
    fe_mod_bs = PanelOLS(DV_bs, IV_bs, entity_effects=True ).fit(cov_type='clustered', cluster_entity=True)
    b = pd.DataFrame(fe_mod_bs.params)
    print(b.head())
    betas = pd.concat([betas, b], axis = 1, join = 'outer')

Unfortunately the bootstrap samples need to be selected by group for the panel regression, so that a complete ID is picked instead of just one row. I could not figure out how to extend the function to create a sample that way. So I basically have two questions:

  1. Does the overall approach make sense for panel regression at all?
  2. How do I adjust the bootstrapping so that the multilevel / panel structure is taken into account and complete IDs instead of single rows are "picked" during the bootstrapping?

Solution

  • I solved my problem with the following code:

    companies = pd.DataFrame(df.reset_index().Company.unique())
    
    betas_summary = pd.DataFrame()
    for i in tqdm(range(1, 10001)):
        # Creating a bootstrap sample with replacement
        bootstrap = companies.sample(n=companies.shape[0], replace=True)
        bootstrap.rename(columns={bootstrap.columns[0]: "Company"}, inplace=True)
        Period = list(range(1, 25))
        list_of_bs_comp = bootstrap.Company.to_list()
        multiindex = [list_of_bs_comp, np.array(Period)]
        bs_df = pd.MultiIndex.from_product(multiindex, names=['Company', 'Period'])
        bs_result = df.loc[bs_df, :]
        
        betas = pd.DataFrame()
        
        # Fit the regression and save beta coefficients
        DV_bs = bs_result.y
        IV_bs = sm2.add_constant(bs_result[['X1', 'X2', 'X3']])
        fe_mod_bs = PanelOLS(DV_bs, IV_bs, entity_effects=True ).fit(cov_type='clustered', cluster_entity=True)
        b = pd.DataFrame(fe_mod_bs.params)
        b.rename(columns={'parameter':"b"}, inplace=True)
        betas = pd.concat([betas, b], axis = 1, join = 'outer')
    

    where Company is my entity variable and Period is my time variable