Currently I have my code set up like this:
def lregression(data, X, y):
X = df['sales'].values.reshape(-1, 1)
y = df['target']
model = LinearRegression()
result = model.fit(X, y)
return model.score(X, y)
Then, I'm trying to apply this model per brand:
df.groupby('brand').apply(lregression, X, y)
But the result just gets applied to the full dataset:
Brand A 0.734
Brand B 0.734
Brand C 0.734
Am I missing something here? I want the model to run separately for each group, but instead I'm apparently getting the model applied to the full dataset and then having the overall score returned for each group. Thanks!
A minimal reproducible example is always nice to have, I'll provide it here:
np.random.seed(42)
data = {
'brand': np.random.choice(['Brand A', 'Brand B', 'Brand C'], size=300),
'sales': np.random.randint(100, 1000, size=300),
'target': np.random.randint(100, 1000, size=300)
}
df = pd.DataFrame(data)
To me it's not clear whether you want to return the score
(namely R^2) or the coef
of the single regressions, in both cases the function changes only slightly:
Score
def lregression(group):
X = group['sales'].values.reshape(-1, 1)
y = group['target']
model = LinearRegression()
result = model.fit(X, y)
return result.score(X, y)
Coefficients
def lregression(group):
X = group['sales'].values.reshape(-1, 1)
y = group['target']
model = LinearRegression()
result = model.fit(X, y)
return result.coef_
Then the final step (coef_
scenario):
>>> df.groupby('brand').apply(lregression)
brand
Brand A [0.20322970187699263]
Brand B [0.09134770152569331]
Brand C [0.043343302335992005]
dtype: object
Which works as expected