I have a df
consist of columns: industry
, comp_id
, year
, y
and x1, x2, x3, ..
. The data looks like this:
indus comp_id year y x1 x2 x3 other_var
0 A 100 2000 90.0 9 44.0 95.0 66
1 A 100 2001 59.0 65 4.0 63.0 52
2 A 100 2002 75.0 49 29.0 64.0 27
3 A 100 2003 83.0 29 86.0 48.0 57
4 A 100 2004 82.0 64 43.0 67.0 58
5 A 100 2005 88.0 30 54.0 75.0 29
6 A 100 2006 91.0 3 46.0 100.0 68
7 A 101 2005 NaN 65 53.0 100.0 57
8 A 101 2006 83.0 56 67.0 60.0 40
9 A 101 2006 112.0 100 97.0 59.0 54
10 A 101 2007 53.0 95 NaN 3.0 55
11 A 101 2007 113.0 28 77.0 96.0 30
12 A 101 2008 84.0 2 79.0 57.0 54
13 A 101 2008 53.0 19 12.0 NaN 62
14 B 102 2002 83.0 92 88.0 31.0 31
15 B 102 2003 102.0 7 67.0 96.0 28
16 B 103 2003 NaN 49 NaN 72.0 22
17 B 103 2004 NaN 60 NaN 52.0 69
18 B 103 2005 NaN 74 NaN 44.0 49
19 B 103 2006 NaN 21 NaN 31.0 38
20 B 103 2007 NaN 69 NaN 36.0 28
21 B 103 2008 NaN 5 NaN 6.0 32
22 B 103 2009 NaN 22 NaN 10.0 61
I need to run a regression by group of company over year and return the new column with residual. My code is:
def func_reg_err(df, yvar, xvar, alpha=True):
y = df[yvar].copy()
x = pd.DataFrame(df[xvar].copy())
if alpha == True:
x['intercept'] = 1.
mod = sm.OLS(y,x, missing='drop')
res = mod.fit()
err = y - mod.predict(res.params, x)
return err
To run regression by group, I use the following code:
df['residual'] = df.groupby('comp_id', group_keys=False).apply(func_reg_err, 'y', ['x1', 'x2', 'x3'], False)
If the data is nice then nothing happens, the code work smoothly. However, in my df
, there are many companies with Missing
data (entire of a column) or companies with too few observation (only 1 or 2). In the sample, they are companies with comp_id = 102, 1003
. One has only 2 observations, and the other has too many missing value. Those group of companies make my code keep returning error.
I could work around by create a copy of df
, then manually filter those disqualify companies, run func_reg_err
and then merge
back with df
again. However, I see this method is just a work around solution. I am looking for an efficiency way to deal with this.
What I want is for those groups that are not qualified, the column residual
will return with np.Nan
You could define a minimum acceptable sample length (either as a total or for one/all regressor columns) at your helper function, for instance:
def func_reg_err(df, yvar, xvar, alpha=True, min_samples=30):
# Return NaNs only
if len(df) < min_samples or (df.notnull().sum(1) < min_samples).any():
return pd.Series(index=df.index)
# Carry on with your regression
y = df[yvar].copy()
x = pd.DataFrame(df[xvar].copy())
if alpha == True:
x['intercept'] = 1.
mod = sm.OLS(y,x, missing='drop')
res = mod.fit()
err = y - mod.predict(res.params, x)
return err