I have created a dictionary of regression models, indexed by values of group
from a training dataset, d
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
d = pd.DataFrame({
"group":["cat","fish","horse","cat","fish","horse","cat","horse"],
"x":[1,4,7,2,5,8,3,9],
"y":[10,20,14,12,12,3,12,2],
"z":[3,5,3,5,9,1,2,3]
})
features, models =['x','z'],{}
for animal in ['horse','cat','fish']:
models[animal] = Pipeline([("estimator",LinearRegression(fit_intercept=True))])
x,y = d.loc[d.group==animal,features],d.loc[d.group==animal,"y"]
models[animal].fit(x,y)
I also have a test dataset, test_d
, which has rows for some, but not all the groups (i.e. all the models).
test_d = pd.DataFrame({
"group":["dog","fish","horse","dog","fish","horse","dog","horse"],
"x":[1,2,3,4,5,6,7,8],
"z":[3,5,3,5,9,1,2,3]
})
I wanted to use apply
on the grouped test_d
, leveraging .name
to lookup the correct model (if it exists), and return the predictions, using a function f()
def f(g):
try:
predictions = models[g.name].predict(g[features])
except:
predictions = [None]*len(g)
return predictions
The function "works" in the sense that it returns the correct values
grouping_column ="group"
test_d.groupby(grouping_column, group_keys=False).apply(f)
Output:
group
dog [None, None, None]
fish [20.94117647058824, 12.000000000000004]
horse [38.0, 15.0, 8.0]
dtype: object
How should f()
be written so that I can assign the values directly to test_d
? I want to do something like this:
test_d["predictions"] = test_d.groupby(grouping_column, group_keys=False).apply(f)
But this doesn't work, obviously.
group x z predictions
0 dog 1 3 NaN
1 fish 2 5 NaN
2 horse 3 3 NaN
3 dog 4 5 NaN
4 fish 5 9 NaN
5 horse 6 1 NaN
6 dog 7 2 NaN
7 horse 8 3 NaN
group x z predictions
0 dog 1 3 NaN
1 fish 2 5 20.941176
2 horse 3 3 38.000000
3 dog 4 5 NaN
4 fish 5 9 12.000000
5 horse 6 1 15.000000
6 dog 7 2 NaN
7 horse 8 3 8.000000
Your function f
should return a Series with the original index:
def f(g):
try:
predictions = models[g.name].predict(g[features])
except:
predictions = [None]*len(g)
return pd.Series(predictions, index=g.index)
test_d.groupby('group', group_keys=False).apply(f)
Output:
0 None
3 None
6 None
1 20.941176
4 12.0
2 38.0
5 15.0
7 8.0
dtype: object
So if you assign, the indices will align:
test_d['predictions'] = test_d.groupby('group', group_keys=False).apply(f)
Output:
group x z predictions
0 dog 1 3 None
1 fish 2 5 20.941176
2 horse 3 3 38.0
3 dog 4 5 None
4 fish 5 9 12.0
5 horse 6 1 15.0
6 dog 7 2 None
7 horse 8 3 8.0