Search code examples
pythonpandasgroup-bylinear-regression

Linear Regression on groupby Pandas DataFrame


Currently I have my code set up like this:

def lregression(data, X, y):
    X = df['sales'].values.reshape(-1, 1)
    y = df['target']
    model = LinearRegression()
    result = model.fit(X, y)
    return model.score(X, y)

Then, I'm trying to apply this model per brand:

df.groupby('brand').apply(lregression, X, y)

But the result just gets applied to the full dataset:

Brand A    0.734
Brand B    0.734
Brand C    0.734

Am I missing something here? I want the model to run separately for each group, but instead I'm apparently getting the model applied to the full dataset and then having the overall score returned for each group. Thanks!


Solution

  • DATAFRAME

    A minimal reproducible example is always nice to have, I'll provide it here:

    np.random.seed(42)
    data = {
        'brand': np.random.choice(['Brand A', 'Brand B', 'Brand C'], size=300),
        'sales': np.random.randint(100, 1000, size=300),
        'target': np.random.randint(100, 1000, size=300)
    }
    
    df = pd.DataFrame(data)
    

    FUNCTION

    To me it's not clear whether you want to return the score (namely R^2) or the coef of the single regressions, in both cases the function changes only slightly:

    Score

    def lregression(group):
        X = group['sales'].values.reshape(-1, 1)
        y = group['target']
        model = LinearRegression()
        result = model.fit(X, y)
        return result.score(X, y)
    

    Coefficients

    def lregression(group):
        X = group['sales'].values.reshape(-1, 1)
        y = group['target']
        model = LinearRegression()
        result = model.fit(X, y)
        return result.coef_
    

    Then the final step (coef_ scenario):

    >>> df.groupby('brand').apply(lregression)
     
    brand
    Brand A     [0.20322970187699263]
    Brand B     [0.09134770152569331]
    Brand C    [0.043343302335992005]
    dtype: object
    

    Which works as expected