Search code examples
pythonpandasdataframepandas-groupbysklearn-pandas

How to predict on a grouped DataFrame, using a dictionary of models, and return to original test DataFrame?


I have created a dictionary of regression models, indexed by values of group from a training dataset, d

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

d = pd.DataFrame({
    "group":["cat","fish","horse","cat","fish","horse","cat","horse"],
    "x":[1,4,7,2,5,8,3,9],
    "y":[10,20,14,12,12,3,12,2],
    "z":[3,5,3,5,9,1,2,3]
})

features, models =['x','z'],{}
for animal in ['horse','cat','fish']:
    models[animal] = Pipeline([("estimator",LinearRegression(fit_intercept=True))])
    x,y = d.loc[d.group==animal,features],d.loc[d.group==animal,"y"]
    models[animal].fit(x,y)

I also have a test dataset, test_d, which has rows for some, but not all the groups (i.e. all the models).

test_d = pd.DataFrame({
    "group":["dog","fish","horse","dog","fish","horse","dog","horse"],
    "x":[1,2,3,4,5,6,7,8],
    "z":[3,5,3,5,9,1,2,3]
})

I wanted to use apply on the grouped test_d, leveraging .name to lookup the correct model (if it exists), and return the predictions, using a function f()

def f(g):
    try:
        predictions = models[g.name].predict(g[features])
    except:
        predictions = [None]*len(g)
    return predictions

The function "works" in the sense that it returns the correct values

grouping_column ="group"
test_d.groupby(grouping_column, group_keys=False).apply(f)

Output:

group
dog                           [None, None, None]
fish     [20.94117647058824, 12.000000000000004]
horse                          [38.0, 15.0, 8.0]
dtype: object

Question:

How should f() be written so that I can assign the values directly to test_d? I want to do something like this:

test_d["predictions"] = test_d.groupby(grouping_column, group_keys=False).apply(f)

But this doesn't work, obviously.

   group  x  z predictions
0    dog  1  3         NaN
1   fish  2  5         NaN
2  horse  3  3         NaN
3    dog  4  5         NaN
4   fish  5  9         NaN
5  horse  6  1         NaN
6    dog  7  2         NaN
7  horse  8  3         NaN

Expected Output

   group  x  z  predictions
0    dog  1  3          NaN
1   fish  2  5    20.941176
2  horse  3  3    38.000000
3    dog  4  5          NaN
4   fish  5  9    12.000000
5  horse  6  1    15.000000
6    dog  7  2          NaN
7  horse  8  3     8.000000

Solution

  • Your function f should return a Series with the original index:

    def f(g):
        try:
            predictions = models[g.name].predict(g[features])
        except:
            predictions = [None]*len(g)
        return pd.Series(predictions, index=g.index)
    
    test_d.groupby('group', group_keys=False).apply(f)
    

    Output:

    0         None
    3         None
    6         None
    1    20.941176
    4         12.0
    2         38.0
    5         15.0
    7          8.0
    dtype: object
    

    So if you assign, the indices will align:

    test_d['predictions'] = test_d.groupby('group', group_keys=False).apply(f)
    

    Output:

       group  x  z predictions
    0    dog  1  3        None
    1   fish  2  5   20.941176
    2  horse  3  3        38.0
    3    dog  4  5        None
    4   fish  5  9        12.0
    5  horse  6  1        15.0
    6    dog  7  2        None
    7  horse  8  3         8.0