Pandas Groupby - Run Self Function - Then Transform(Apply)

I need to run regression for each group, then pass the coefficients into the new column b. Here is my code:

Self-defined function:

def simplereg(g, y, x):
    try:
        xvar = sm.add_constant(g[x])
        yvar = g[y]
        model = sm.OLS(yvar, xvar, missing='drop').fit()
        b = model.params[x]
        return pd.Series([b*100]*len(g))
    except Exception as e:
        return pd.Series([np.NaN]*len(g))

Create sample data:

import pandas as pd
import numpy as np

# Setting the parameters
gvkeys = ['A', 'B', 'C', 'D']  # Possible values for gvkey
years = np.arange(2000, 2020)  # Possible values for year

# Number of rows for each gvkey, ensuring 5-7 observations for each
num_rows_per_gvkey = np.random.randint(5, 8, size=len(gvkeys))
total_rows = sum(num_rows_per_gvkey)

# Creating the DataFrame
np.random.seed(0)  # For reproducibility
df = pd.DataFrame({
    'gvkey': np.repeat(gvkeys, num_rows_per_gvkey),
    'year': np.random.choice(years, size=total_rows),
    'y': np.random.rand(total_rows),
    'x': np.random.rand(total_rows)
})

df.sort_values(by='year', ignore_index=True, inplace=True) # make sure if the code can handle even data without sort

Run groupby code:

df['b'] = df.groupby('gvkey').apply(simplereg, y='y', x='x')

However, the code return column 'b' with all N/A May I ask where is the issue and how to fix it?

Thank you

Solution

Is it a bad idea to catch all exceptions?

Obviously yes. In addition you do not display any error message. If your function returns NaN values, it's probably because your code is throwing an exception.

Your code works well for me if I import statsmodels.api as sm and make minor changes:

import statsmodels.api as sm

def simplereg(g, y, x):
    try:
        xvar = sm.add_constant(g[x])
        yvar = g[y]
        model = sm.OLS(yvar, xvar, missing='drop').fit()
        b = model.params[x]
        return pd.Series([b*100]*len(g), index=g.index)  # reindex here
    except Exception as e:
        print(e)  # At least, print exception here
        return pd.Series([np.NaN]*len(g), index=g.index)  # reindex here

# drop the first group level (gvkey) to align indexes
df['b'] = df.groupby('gvkey').apply(simplereg, y='y', x='x').droplevel('gvkey')

Output:

>>> df
   gvkey  year         y         x          b
0      A  2000  0.799159  0.128926 -55.326856
1      B  2001  0.774234  0.253292 -68.351309
2      A  2003  0.461479  0.315428 -55.326856
3      A  2003  0.780529  0.363711 -55.326856
4      B  2004  0.521848  0.208877 -68.351309
5      C  2005  0.612096  0.656330   6.342994
6      D  2005  0.060225  0.096098  36.320231
7      B  2006  0.414662  0.161310 -68.351309
8      C  2006  0.456150  0.466311   6.342994
9      A  2007  0.118274  0.570197 -55.326856
10     C  2007  0.568434  0.244426   6.342994
11     C  2008  0.943748  0.196582   6.342994
12     D  2009  0.681820  0.368725  36.320231
13     B  2009  0.639921  0.438602 -68.351309
14     A  2012  0.870012  0.670638 -55.326856
15     B  2012  0.264556  0.653108 -68.351309
16     C  2013  0.616934  0.138183   6.342994
17     C  2014  0.018790  0.158970   6.342994
18     A  2015  0.978618  0.210383 -55.326856
19     D  2015  0.666767  0.976459  36.320231
20     D  2016  0.437032  0.097101  36.320231
21     C  2017  0.617635  0.110375   6.342994
22     B  2018  0.944669  0.102045 -68.351309
23     B  2019  0.143353  0.988374 -68.351309
24     D  2019  0.359508  0.820993  36.320231
25     D  2019  0.697631  0.837945  36.320231