Search code examples
pythonpython-3.xpandasgroup-by

Pandas Groupby - Run Self Function - Then Transform(Apply)


I need to run regression for each group, then pass the coefficients into the new column b. Here is my code:

Self-defined function:

def simplereg(g, y, x):
    try:
        xvar = sm.add_constant(g[x])
        yvar = g[y]
        model = sm.OLS(yvar, xvar, missing='drop').fit()
        b = model.params[x]
        return pd.Series([b*100]*len(g))
    except Exception as e:
        return pd.Series([np.NaN]*len(g))

Create sample data:

import pandas as pd
import numpy as np

# Setting the parameters
gvkeys = ['A', 'B', 'C', 'D']  # Possible values for gvkey
years = np.arange(2000, 2020)  # Possible values for year

# Number of rows for each gvkey, ensuring 5-7 observations for each
num_rows_per_gvkey = np.random.randint(5, 8, size=len(gvkeys))
total_rows = sum(num_rows_per_gvkey)

# Creating the DataFrame
np.random.seed(0)  # For reproducibility
df = pd.DataFrame({
    'gvkey': np.repeat(gvkeys, num_rows_per_gvkey),
    'year': np.random.choice(years, size=total_rows),
    'y': np.random.rand(total_rows),
    'x': np.random.rand(total_rows)
})

df.sort_values(by='year', ignore_index=True, inplace=True) # make sure if the code can handle even data without sort

Run groupby code:

df['b'] = df.groupby('gvkey').apply(simplereg, y='y', x='x')

However, the code return column 'b' with all N/A May I ask where is the issue and how to fix it?

Thank you


Solution

  • Is it a bad idea to catch all exceptions?

    Obviously yes. In addition you do not display any error message. If your function returns NaN values, it's probably because your code is throwing an exception.

    Your code works well for me if I import statsmodels.api as sm and make minor changes:

    import statsmodels.api as sm
    
    def simplereg(g, y, x):
        try:
            xvar = sm.add_constant(g[x])
            yvar = g[y]
            model = sm.OLS(yvar, xvar, missing='drop').fit()
            b = model.params[x]
            return pd.Series([b*100]*len(g), index=g.index)  # reindex here
        except Exception as e:
            print(e)  # At least, print exception here
            return pd.Series([np.NaN]*len(g), index=g.index)  # reindex here
    
    # drop the first group level (gvkey) to align indexes
    df['b'] = df.groupby('gvkey').apply(simplereg, y='y', x='x').droplevel('gvkey')
    

    Output:

    >>> df
       gvkey  year         y         x          b
    0      A  2000  0.799159  0.128926 -55.326856
    1      B  2001  0.774234  0.253292 -68.351309
    2      A  2003  0.461479  0.315428 -55.326856
    3      A  2003  0.780529  0.363711 -55.326856
    4      B  2004  0.521848  0.208877 -68.351309
    5      C  2005  0.612096  0.656330   6.342994
    6      D  2005  0.060225  0.096098  36.320231
    7      B  2006  0.414662  0.161310 -68.351309
    8      C  2006  0.456150  0.466311   6.342994
    9      A  2007  0.118274  0.570197 -55.326856
    10     C  2007  0.568434  0.244426   6.342994
    11     C  2008  0.943748  0.196582   6.342994
    12     D  2009  0.681820  0.368725  36.320231
    13     B  2009  0.639921  0.438602 -68.351309
    14     A  2012  0.870012  0.670638 -55.326856
    15     B  2012  0.264556  0.653108 -68.351309
    16     C  2013  0.616934  0.138183   6.342994
    17     C  2014  0.018790  0.158970   6.342994
    18     A  2015  0.978618  0.210383 -55.326856
    19     D  2015  0.666767  0.976459  36.320231
    20     D  2016  0.437032  0.097101  36.320231
    21     C  2017  0.617635  0.110375   6.342994
    22     B  2018  0.944669  0.102045 -68.351309
    23     B  2019  0.143353  0.988374 -68.351309
    24     D  2019  0.359508  0.820993  36.320231
    25     D  2019  0.697631  0.837945  36.320231