python pandas group-by linear-regression

Linear Regression on groupby Pandas DataFrame

Currently I have my code set up like this:

def lregression(data, X, y):
    X = df['sales'].values.reshape(-1, 1)
    y = df['target']
    model = LinearRegression()
    result = model.fit(X, y)
    return model.score(X, y)

Then, I'm trying to apply this model per brand:

df.groupby('brand').apply(lregression, X, y)

But the result just gets applied to the full dataset:

Brand A    0.734
Brand B    0.734
Brand C    0.734

Am I missing something here? I want the model to run separately for each group, but instead I'm apparently getting the model applied to the full dataset and then having the overall score returned for each group. Thanks!

Solution

DATAFRAME

A minimal reproducible example is always nice to have, I'll provide it here:

np.random.seed(42)
data = {
    'brand': np.random.choice(['Brand A', 'Brand B', 'Brand C'], size=300),
    'sales': np.random.randint(100, 1000, size=300),
    'target': np.random.randint(100, 1000, size=300)
}

df = pd.DataFrame(data)

FUNCTION

To me it's not clear whether you want to return the score (namely R^2) or the coef of the single regressions, in both cases the function changes only slightly:

Score

def lregression(group):
    X = group['sales'].values.reshape(-1, 1)
    y = group['target']
    model = LinearRegression()
    result = model.fit(X, y)
    return result.score(X, y)

Coefficients

def lregression(group):
    X = group['sales'].values.reshape(-1, 1)
    y = group['target']
    model = LinearRegression()
    result = model.fit(X, y)
    return result.coef_

Then the final step (coef_ scenario):

>>> df.groupby('brand').apply(lregression)
 
brand
Brand A     [0.20322970187699263]
Brand B     [0.09134770152569331]
Brand C    [0.043343302335992005]
dtype: object

Which works as expected