TV | newspaper | radio | Sales |
---|---|---|---|
87 | 13 | 23 | 123 |
89 | 09 | 34 | 169 |
Hi,all! first post, so I apologize for the formatting.
I created this function and it works when I set a feature equal to a single column name, but I want to be able to pass in a list for multiple features (tv and radio) for example.
def coefficient_intercept(df,features,target):
ybar = np.mean(df[target])
xbar = np.mean(df[features])
yi = df[target]
xi = df[features]
coefficient = sum((yi - ybar) * (xi - xbar)) / sum((xi - xbar)**2)
intercept = ybar - coefficient * xbar
index = [features]
data = np.array([[coefficient],[intercept]])
return pd.DataFrame(data = data.T, index=index,columns =['coefficient','intercept'])
I've tried a few different loops and a couple iloc methods that didn't get me far. Mainly, I keep running into this error when I pass in ['TV','newspaper'] for example, and I want to understand where the string or'str' part is coming from? Thanks for your time!
/usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3430: FutureWarning: In a future version, DataFrame.mean(axis=None) will return a scalar mean over the entire DataFrame. To retain the old behavior, use 'frame.mean(axis=0)' or just 'frame.mean()'
return mean(axis=axis, dtype=dtype, out=out, **kwargs)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-31-25f7ee028e22> in <cell line: 1>()
1 coefficient_intercept(df,features = ['TV','newspaper'],target = 'sales')
<ipython-input-28-5878e49e78b4> in coefficient_intercept(df, features, target)
15 xi = df[features]
16
17 coefficient = sum((yi - ybar) * (xi - xbar)) / sum((xi - xbar)**2)
18 intercept = ybar - coefficient * xbar
19
TypeError: unsupported operand type(s) for +: 'int' and 'str
def coefficient_intercept(df,features,target):
for column in df.columns:
if column in features:
ybar = np.mean(df[target])
xbar = np.mean(df[features])
yi = df[target]
xi = df[features]
coefficient = sum((yi - ybar) * (xi - xbar)) / sum((xi - xbar)**2)
intercept = ybar - coefficient * xbar
index = [features]
data = np.array([[coefficient],[intercept]])
print(pd.DataFrame(data = data.T, index=index,columns =['coefficient','intercept']))
If you want to generalize to more features, you have to use 2D-arrays even if you have only one feature:
def coefficient_intercept(df, features, target):
# Convert as 2D-array if it's not already the case
features = features if isinstance(features, list) else [features]
target = [target]
# Get values as numpy array
yi = df[target].values
xi = df[features].values
# Compute the mean
ybar = np.mean(yi, axis=0)
xbar = np.mean(xi, axis=0)
# Linear Regression
coefficient = sum((yi - ybar) * (xi - xbar)) / sum((xi - xbar)**2)
intercept = ybar - coefficient * xbar
# Build your dataframe
return pd.DataFrame({'coefficient': coefficient,
'intercept': intercept},
index=features)
Output:
>>> coefficient_intercept(df, 'radio', 'Sales')
coefficient intercept
radio 4.181818 26.818182
>>> coefficient_intercept(df, ['radio', 'TV'], 'Sales')
coefficient intercept
radio 4.181818 26.818182
TV 23.000000 -1878.000000