Search code examples
pythonpandasdataframelinear-regressionstatsmodels

Cant make Prediction on OLS Model


I'm Building an OLS Model but cant make any predictions.

Can you explain what I'm doing wrong?

Building the model :

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm 
import matplotlib.pyplot as plt

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'], 
     'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
     'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
     'Total':[100,100,200,300,10,20,40,50,60,100,500]}

d = pd.DataFrame(data=d).set_index('Client Number')

df = pd.get_dummies(d,prefix='', prefix_sep='')

X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
Y = df['Total']

X1 = sm.add_constant(X)
reg = sm.OLS(Y, X1).fit()

reg.summary()

Prediction:

d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
     'Card': ['Visa','Visa','Visa'],
     'Colateral':['Yes','Yes','No'],
     'Client Number':[11,12,13],
     'Total':[0,0,0]}

df1 = pd.DataFrame(data=d1).set_index('Client Number')

df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
y_new = df1[['Lisbon','Tokyo','Visa','No','Yes']]
x_new = df1['Total']
mod = sm.OLS(y_new, x_new)

mod.predict(reg.params)

Then it shows : ValueError: shapes (3,1) and (11,) not aligned: 1 (dim 1) != 11 (dim 0)

What Am I doing wrong?


Solution

  • Here is the fixed prediction part of code with my comments:

    d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
         'Card': ['Visa','Visa','Visa'],
         'Colateral':['Yes','Yes','No'],
         'Client Number':[11,12,13],
         'Total':[0,0,0]}
    
    df1 = pd.DataFrame(data=d1).set_index('Client Number')
    df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
    x_new = df1.drop(columns='Total')
    

    The main problem is different number of dummies in training X1 and x_new dataset. Below I add missing dummy columns and fill it with zero:

    x_new = x_new.reindex(columns = X1.columns, fill_value=0)
    

    now x_new has proper number of columns equal to training dataset X1:

                   const  Lisbon  London  Madrid  ...  Master Card  Visa  No  Yes
    Client Number                                 ...                            
    11                 0       0       0       0  ...            0     1   0    1
    12                 0       0       0       0  ...            0     1   0    1
    13                 0       1       0       0  ...            0     1   1    0
    
    [3 rows x 11 columns]
    

    Finally predict on new dataset x_new using previously trained model reg:

    reg.predict(x_new)
    

    result:

    Client Number
    11     35.956284
    12     35.956284
    13    135.956284
    dtype: float64
    

    APPENDIX

    As requested I enclose below fully reproducible code to test both training and prediction tasks:

    import numpy as np
    import pandas as pd
    from scipy import stats
    import statsmodels.api as sm 
    import matplotlib.pyplot as plt
    
    d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','New York','Madrid','London','Tokyo','London','Tokyo'], 
         'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
         'Colateral':['Yes','Yes','No','No','Yes','No','No','Yes','Yes','No','Yes'],
         'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
         'Total':[100,100,200,300,10,20,40,50,60,100,500]}
    
    d = pd.DataFrame(data=d).set_index('Client Number')
    
    df = pd.get_dummies(d,prefix='', prefix_sep='')
    
    X = df[['Lisbon','London','Madrid','New York','Tokyo','Bitcoin','Master Card','Visa','No','Yes']]
    Y = df['Total']
    
    X1 = sm.add_constant(X)
    reg = sm.OLS(Y, X1).fit()
    
    reg.summary()
    
    ###
    d1 = {'City': ['Tokyo','Tokyo','Lisbon'], 
         'Card': ['Visa','Visa','Visa'],
         'Colateral':['Yes','Yes','No'],
         'Client Number':[11,12,13],
         'Total':[0,0,0]}
    
    df1 = pd.DataFrame(data=d1).set_index('Client Number')
    df1 = pd.get_dummies(df1,prefix='', prefix_sep='')
    x_new = df1.drop(columns='Total')
    
    x_new = x_new.reindex(columns = X1.columns, fill_value=0)
    
    reg.predict(x_new)