Working with Python 3 in a Jupyter
notebook. I am trying to create a regression model (equation?) to predict the Eng as % of Followers
variable. I'd be given Media Type
, Hour Created
, and Day of Week
. These should all be treated as categorical variables.
Here is some of the past data I have.
Media Type Eng as % of Followers Hour Created Day of Week
0 Video 0.0136 23 Tuesday
1 Video 0.0163 22 Wednesday
2 Video 0.0163 22 Tuesday
3 Video 0.0196 22 Friday
4 Video 0.0179 20 Thursday
5 Photo 0.0087 14 Wednesday
I've created dummy variables
using pd.get_dummies
, but I'm not sure I did that correctly - the problem specifically lies with the Hour Created
variable. They're numbers, but I want them treated as categories. For example, Hour 22 might be a performance booster, but that shouldn't imply anything about Hours 21 or 23.
I'm also curious if I could have my model factor in the interaction between Day of Week
and Hour Created
(maybe Hour 22 is a boost on most days, but 22-Friday causes a dip) like I've seen done with patsy... but that might be me getting greedy.
Here is how I created my dummy variables, which sets me up for the issue of having Hour Created
as a quantitative variable, instead of qualitative. Also, the Vars dataframe that I'd use going forward now doesn't have the very thing that I'm trying to predict. Could that possibly be right?
Vars = Training[['Hour Created','Day of Week','Media Type']]
Result = Training['Eng as % of Followers']
Vars = pd.get_dummies(data=Vars, drop_first=True)
If someone could help with the Hour Created problem, that would be a great start.... And then, not sure where to go from there. I've seen people use the ols function in this situation. Or linear_model from sklearn. I'm struggling with how to interpret the results from either, and especially struggling with how I'd plug a dataframe of those 3 independent variables into that model. If someone can make a suggestion, I'll try to run with it.
Edit: Including a couple of ways I tried to create this model. Here's the first, which I assume is using my Hour data incorrectly. And being that the dataframe I'm passing into it doesn't even have Eng as % of Followers as a column header, I'm not even sure what it's trying to predict...
Vars_train, Vars_test, Result_train, Result_test = train_test_split(Vars, Result, test_size = .20, random_state = 40)
regr = linear_model.LinearRegression()
regr.fit(Vars_train, Result_train)
predicted = regr.predict(Vars_test)
When I try to use the ols method as follows, I get an invalid syntax error. I've tried different variations to no avail.
fit1 = ols('Eng as % of Followers ~ C(Day of Week) + C(Hour Created) + C(Media Type)', data=Training).fit()
One way to make sure that you are doing dummy coding correctly is to convert the columns to str
types. In your case you want consider Hour Created
as categorical though it is numeric in nature, so it is better to convert them to strings before doing dummy coding.
In order to capture interaction between Day of Week
and Hour Created
do a feature engineering and create your own feature by multiplying Day of Week
and Hour Created
and feed it as an input to your model.
In order to understand/interpret your model you can look at the weights/coefficients of different features which gives an idea of how each and every feature impacts your target variable positively or negatively.
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df
Media Type Eng_as_%_of_Followers Hour_Created Day_of_Week
0 0 Video 0.0136 23 Tuesday
1 1 Video 0.0163 22 Wednesday
2 2 Video 0.0163 22 Tuesday
3 3 Video 0.0196 22 Friday
4 4 Video 0.0179 20 Thursday
5 5 Photo 0.0087 14 Wednesday
df["Hour_Created"] = df["Hour_Created"].astype(str)
df["Interaction"] = df["Hour_Created"] + "_" +df["Day_of_Week"]
X = df.drop("Eng_as_%_of_Followers", axis=1)
Y = df["Eng_as_%_of_Followers"]
X_encoded = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(
X_encoded, Y, test_size=0.33, random_state=42)
reg = LinearRegression().fit(X_train, y_train)
coef_dict = dict(zip(X_encoded.columns, reg.coef_))
coef_dict
{'Day_of_Week_Friday': 0.0012837455830388678,
'Day_of_Week_Thursday': 0.0007424028268551229,
'Day_of_Week_Tuesday': -0.0008084805653710235,
'Day_of_Week_Wednesday': -0.0012176678445229678,
'Hour_Created_14': -0.0012176678445229678,
'Hour_Created_20': 0.0007424028268551229,
'Hour_Created_22': 0.0004752650176678456,
'Hour_Created_23': 0.0,
'Interaction_14_Wednesday': -0.0012176678445229678,
'Interaction_20_Thursday': 0.0007424028268551229,
'Interaction_22_Friday': 0.0012837455830388678,
'Interaction_22_Tuesday': -0.0008084805653710235,
'Interaction_22_Wednesday': 0.0,
'Interaction_23_Tuesday': 0.0,
'Media': -0.0008844522968197866,
'Type_Photo': -0.0012176678445229708,
'Type_Video': 0.0012176678445229685}
Of course the results may not be really interesting here, coz I was just working with 6 data points.
Answering your questions
You can find out the y_intercept
using reg.intercept_
Yes you can plug in new values for x and get your target variable by using reg.predict(x)
, where x is your new input.
Regression done by OLS
and sklearn
are one and the same. OLS is nothing but a way to solve the optimization problem which we have in regression.