Search code examples
pythonmachine-learningscikit-learnlinear-regression

Negative accuracy in linear regression


My linear regression model has negative coefficient of determination R².

How can this happen? Any idea is helpful.

Here is my dataset:

year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0

The code of the LinearRegression model is as follows:

import pandas as pd

from sklearn.linear_model import LinearRegression

data =pd.read_csv("data.csv", header=None )

data = data.drop(0,axis=0)

X=data[0]

Y=data[1]

from sklearn.model_selection import train_test_split 

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1,shuffle =False)

lm = LinearRegression()

lm.fit(X_train.values.reshape(-1,1), Y_train.values.reshape(-1,1))

Y_pred = lm.predict(X_test.values.reshape(-1,1))

accuracy = lm.score(Y_test.values.reshape(-1,1),Y_pred)

print(accuracy)
output
-3592622948027972.5

Solution

  • Sckit-learn’s LinearRegression scores uses 𝑅2 score. A negative 𝑅2 means that the model fitted your data extremely bad. Since 𝑅2 compares the fit of the model with that of the null hypothesis( a horizontal straight line ), then 𝑅2 is negative when the model fits worse than a horizontal line.

    𝑅2 = 1 - (SUM((y - ypred)**2) / SUM((y - AVG(y))**2))
    

    So if SUM((y - ypred)**2 is greater than SUM((y - AVG(y))**2, then 𝑅2 will be negative.

    reasons and ways to correct it

    Problem 1: You are performing a random split of time-series data. Random split will ignore the temporal dimension.
    Solution: Preserve time flow (See code below)

    Problem 2: Target values are so large.
    Solution: Unless we use Tree-base models, you would have to do some target feature engineering to scale data in a range that models can learn.

    Here is a code example. Using defaults parameters of LinearRegression and log|exp transformation of our target values, my attempt yield ~87% R2 score:

    
    import pandas as pd
    import numpy as np
    
    # we need to transform/feature engineer our target
    # I will use log from numpy. The np.log and np.exp to make the value learnable
    
    from sklearn.linear_model import LinearRegression
    from sklearn.compose import TransformedTargetRegressor
    
    # your data, df
    
    # transform year to reference
    
    df = df.assign(ref_year = lambda x: x.year - 1960)
    df.population = df.population.astype(int)
    
    split = int(df.shape[0] *.9) #split at 90%, 10%-ish
    
    df = df[['ref_year', 'population']]
    
    train_df = df.iloc[:split]
    test_df = df.iloc[split:]
    
    X_train = train_df[['ref_year']]
    y_train = train_df.population
    
    X_test = test_df[['ref_year']]
    y_test = test_df.population
    
    
    # regressor
    regressor = LinearRegression()
    
    lr = TransformedTargetRegressor(
            regressor=regressor, 
            func=np.log, inverse_func=np.exp)
    
    lr.fit(X_train,y_train)
    print(lr.score(X_test,y_test))
    

    For those interested in making it better, here is a way to read that dataset

    import pandas as pd
    import io
    
    df = pd.read_csv(io.StringIO('''year,population
    1960,22151278.0 
    1961,22671191.0 
    1962,23221389.0 
    1963,23798430.0 
    1964,24397022.0 
    1965,25013626.0 
    1966,25641044.0 
    1967,26280132.0 
    1968,26944390.0 
    1969,27652709.0 
    1970,28415077.0 
    1971,29248643.0 
    1972,30140804.0 
    1973,31036662.0 
    1974,31861352.0 
    1975,32566854.0 
    1976,33128149.0 
    1977,33577242.0 
    1978,33993301.0 
    1979,34487799.0 
    1980,35141712.0 
    1981,35984528.0 
    1982,36995248.0 
    1983,38142674.0 
    1984,39374348.0 
    1985,40652141.0 
    1986,41965693.0 
    1987,43329231.0 
    1988,44757203.0 
    1989,46272299.0 
    1990,47887865.0 
    1991,49609969.0 
    1992,51423585.0 
    1993,53295566.0 
    1994,55180998.0
    1995,57047908.0 
    1996,58883530.0 
    1997,60697443.0 
    1998,62507724.0 
    1999,64343013.0 
    2000,66224804.0 
    2001,68159423.0 
    2002,70142091.0 
    2003,72170584.0 
    2004,74239505.0
    2005,76346311.0
    2006,78489206.0 
    2007,80674348.0 
    2008,82916235.0 
    2009,85233913.0 
    2010,87639964.0 
    2011,90139927.0 
    2012,92726971.0 
    2013,95385785.0 
    2014,98094253.0 
    2015,100835458.0 
    2016,103603501.0 
    2017,106400024.0 
    2018,109224559.0
    '''))
    

    Results: enter image description here