python regression categorical-data dummy-variable

Creating a regression model using Day of Week, Hour of Day, and Type of Media?

Working with Python 3 in a Jupyter notebook. I am trying to create a regression model (equation?) to predict the Eng as % of Followers variable. I'd be given Media Type, Hour Created, and Day of Week. These should all be treated as categorical variables.

Here is some of the past data I have.

    Media Type  Eng as % of Followers   Hour Created    Day of Week
0   Video   0.0136  23  Tuesday
1   Video   0.0163  22  Wednesday
2   Video   0.0163  22  Tuesday
3   Video   0.0196  22  Friday
4   Video   0.0179  20  Thursday
5   Photo   0.0087  14  Wednesday

I've created dummy variables using pd.get_dummies, but I'm not sure I did that correctly - the problem specifically lies with the Hour Created variable. They're numbers, but I want them treated as categories. For example, Hour 22 might be a performance booster, but that shouldn't imply anything about Hours 21 or 23.

I'm also curious if I could have my model factor in the interaction between Day of Week and Hour Created (maybe Hour 22 is a boost on most days, but 22-Friday causes a dip) like I've seen done with patsy... but that might be me getting greedy.

Here is how I created my dummy variables, which sets me up for the issue of having Hour Created as a quantitative variable, instead of qualitative. Also, the Vars dataframe that I'd use going forward now doesn't have the very thing that I'm trying to predict. Could that possibly be right?

Vars = Training[['Hour Created','Day of Week','Media Type']]
Result = Training['Eng as % of Followers']
Vars = pd.get_dummies(data=Vars, drop_first=True)

If someone could help with the Hour Created problem, that would be a great start.... And then, not sure where to go from there. I've seen people use the ols function in this situation. Or linear_model from sklearn. I'm struggling with how to interpret the results from either, and especially struggling with how I'd plug a dataframe of those 3 independent variables into that model. If someone can make a suggestion, I'll try to run with it.

Edit: Including a couple of ways I tried to create this model. Here's the first, which I assume is using my Hour data incorrectly. And being that the dataframe I'm passing into it doesn't even have Eng as % of Followers as a column header, I'm not even sure what it's trying to predict...

Vars_train, Vars_test, Result_train, Result_test = train_test_split(Vars, Result, test_size = .20, random_state = 40)
regr = linear_model.LinearRegression() 
regr.fit(Vars_train, Result_train)
predicted = regr.predict(Vars_test)

When I try to use the ols method as follows, I get an invalid syntax error. I've tried different variations to no avail.

fit1 = ols('Eng as % of Followers ~ C(Day of Week) + C(Hour Created) + C(Media Type)', data=Training).fit()

Solution

One way to make sure that you are doing dummy coding correctly is to convert the columns to str types. In your case you want consider Hour Created as categorical though it is numeric in nature, so it is better to convert them to strings before doing dummy coding.
In order to capture interaction between Day of Week and Hour Created do a feature engineering and create your own feature by multiplying Day of Week and Hour Created and feed it as an input to your model.
In order to understand/interpret your model you can look at the weights/coefficients of different features which gives an idea of how each and every feature impacts your target variable positively or negatively.

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df 

Media   Type    Eng_as_%_of_Followers   Hour_Created    Day_of_Week
0   0   Video   0.0136                  23              Tuesday
1   1   Video   0.0163                  22              Wednesday
2   2   Video   0.0163                  22              Tuesday
3   3   Video   0.0196                  22              Friday
4   4   Video   0.0179                  20              Thursday
5   5   Photo   0.0087                  14              Wednesday 

df["Hour_Created"] = df["Hour_Created"].astype(str)
df["Interaction"] = df["Hour_Created"] + "_" +df["Day_of_Week"] 

X = df.drop("Eng_as_%_of_Followers", axis=1)
Y = df["Eng_as_%_of_Followers"]

X_encoded = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, Y, test_size=0.33, random_state=42)

reg = LinearRegression().fit(X_train, y_train)

coef_dict = dict(zip(X_encoded.columns, reg.coef_))

coef_dict

{'Day_of_Week_Friday': 0.0012837455830388678,
 'Day_of_Week_Thursday': 0.0007424028268551229,
 'Day_of_Week_Tuesday': -0.0008084805653710235,
 'Day_of_Week_Wednesday': -0.0012176678445229678,
 'Hour_Created_14': -0.0012176678445229678,
 'Hour_Created_20': 0.0007424028268551229,
 'Hour_Created_22': 0.0004752650176678456,
 'Hour_Created_23': 0.0,
 'Interaction_14_Wednesday': -0.0012176678445229678,
 'Interaction_20_Thursday': 0.0007424028268551229,
 'Interaction_22_Friday': 0.0012837455830388678,
 'Interaction_22_Tuesday': -0.0008084805653710235,
 'Interaction_22_Wednesday': 0.0,
 'Interaction_23_Tuesday': 0.0,
 'Media': -0.0008844522968197866,
 'Type_Photo': -0.0012176678445229708,
 'Type_Video': 0.0012176678445229685}

Of course the results may not be really interesting here, coz I was just working with 6 data points.

Answering your questions

You can find out the y_intercept using reg.intercept_
Yes you can plug in new values for x and get your target variable by using reg.predict(x), where x is your new input.
Regression done by OLS and sklearn are one and the same. OLS is nothing but a way to solve the optimization problem which we have in regression.