Search code examples
pythonpython-3.xstatsmodels

statsmodels: Questions regarding add_constant


I've been trying to get into Python and have been using some online courses (I'm working with Jupyter Notebooks, if that matters, and Python 3). In one, it was about statsmodels and regressions. As far as my statistics courses have told me, you want to include an intercept (I'm sure there are reasons not to, but afaik it's the exception).

1) I tried asking google and stumbled across an example I don't quite get: This is an example from the statsmodels site:

import statsmodels.api as sm
Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
results.params

I get what they're doing here. However, just to try some things out, I thought I'd leave out the intercept:

import statsmodels.api as sm
Y = [1,3,4,5,2,3,4]
X = range(1,8)
model = sm.OLS(Y,X)
results = model.fit()
results.params

Question 1: This returns an error: ValueError Traceback (most recent call last) <ipython-input-3-c8dfe3eb8b44> in <module>. It points at line model = sm.OLS(Y,X) for the error - why?

2a) Here's the code as it was in the course:

It's about predicting the price of a car based on multiple variables (mileage, cylinders, doors)

import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')

%matplotlib inline
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']
X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].values)

print (X)
est = sm.OLS(y, X).fit()
est.summary()

Question 2: This seems to work, but it also returns an error: "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead" - What does that mean? Is it just a heads up from pandas to keep warn about potentially wrong syntax, as this discussion seems to suggest?

2b) Same code with an intercept:

import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')

%matplotlib inline
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']
X = sm.tools.tools.add_constant(X)
X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].values)

print (X)
est = sm.OLS(y, X).fit()
est.summary()

Question 3: The coefficients don't change compared to the model without adding the constant - what am I doing wrong? Also, when executing print(X), the constant is listed as 1 observation, is that because it's basically a placeholder at that point? But why wouldn't it be 0?

Question 4: And to stay on topic of what I am not understanding: When standardization is applied with scale.fit_transform, does it matter if the constant is added before or after it?

If someone could help me with any of these questions, I'd really appreciate it.


Solution

  • I think this answer will helpful and please let me know if there anything missing or wrong.

    Question 1 - In python range is a immutable iterable objects that lets you iterate over them, it does not produce lists.

    >>> range(1)
    range(0, 1)
    >>> type(range(1))
    <class 'range'>
    

    You can use range in a for loop.But you can't use it as a list object.You need to get list from the range object and fit to the OLS.

    X = list(range(1,8))
    

    Question 2 - This means when you have a subset of a dataframe,and you want to modify a particular value of the original dataframe,but it will update the subset instead of origianl dataframe or vice versa.read more in this link

    https://www.dataquest.io/blog/settingwithcopywarning/

    Question 3 - constant indicates where the line crosses the y-axis.For example lets say you have 3 linear functions.

    1) y = 3x + 5
    2) y = 3x - 5
    3) y = 3x + 0
    

    in these 3 functions,coefficients are 3 and constats are +5,-5 and 0.Which means you have same slope for all the functions, but the point where crosses the y-axis is different.

    Question 4 - Standardizing the features around the center and 0 with a standard deviation of 1.If you standardize a constant array it will be zero.because the mean is equal to constant.So i think you should add constant after standardization.

    Xchanged=(X−μ)/σ
    

    example

    from sklearn.preprocessing import StandardScaler
    x = np.asarray([5]*10, dtype=np.float64)
    standardized_data = StandardScaler().fit_transform(x.reshape(-1,1))
    

    output

    array([[0.],
           [0.],
           [0.],
           [0.],
           [0.],
           [0.],
           [0.],
           [0.],
           [0.],
           [0.]])