I am creating a GLM model with a number of variables. After I obtain my output I am then using the GLM to predict new values.
I have noticed after manually changing a GLM coefficient for one of the categorical variable's levels I am still getting the same Predicted values even though I know some of my data has this level. Some code might help explain my process:
##data frame
df <-data.frame(Account =c("A","B","C","D","E","F","G","H"),
Exposure = c(1,50,67,85,250,25,22,89),
JudicialOrientation=c("Neutral","Neutral","Plaintiff","Defense","Plaintiff","Neutral","Plaintiff","Defense"),
Freq= c(.008,.5,.05,.34,.7,0,.04,.12),
Losses = c(100000,100,2500,100000,25000,0,7500,5200),
LossPerUnit = c(100000,100,2500,100000,25000,0,7500,5200)/c(1,50,67,85,250,25,22,89))
##Variables for modeling
ModelingVars <- as.formula(df$LossPerUnit~df$JudicialOrientation+df$Freq)
##Tweedie GLM
Model <- glm(ModelingVars, family=tweedie(var.power=1.5, link.power = 0),
weight = Exposure, data = df)
summary(Model)
##Predict Losses with Model coefficients
df$PredictedLossPerUnit <- predict(Model,df, type="response")
##Manually edit a coefficient for one of my categorical variable's levels
Model$coefficients["df$JudicialOrientationNeutral"] <-log(50)
##Predict Losses again to compare
df$PredictedLossPerUnit2 <- predict(Model, df, type ="response")
sum(df$PredictedLossPerUnit)
sum(df$PredictedLossPerUnit2)
View(head(df))
summary(Model)
This code works fine and both PredictedLossPerUnits have different numbers (if the row had an observation of "JudicialOrientationNeutral"). When I go to do something similar on my main data set which has more variables but are in a similar fashion (some continuous, some discrete with multiple bins) I keep getting the same predicted values for my predict function even after I manipulate a coefficient.
Is there anything strange that would cause my predict function to continue to give same results as the original - even after I manually changed a coefficient in my GLM?
EDIT: I Found the answer. In my other data set I was doing: df$PredictedLossPerUnit <- predict(Model,data=df, type="response")
data isnt actually an argument for the predict function, it should have been "newdata". A silly mistake but a good lesson. Thanks to all that helped.
You are using the formula in a manner that detached the meaning from the df object or confused the logic of predict.lm
or something. If you instead run the formula creation the way it was intended to be used (without reference to a data object's name ( so using only column names), you get the desired effect:
ModelingVars <- as.formula(LossPerUnit~JudicialOrientation+Freq)
#----------
> df$PredictedLossPerUnit <- predict(Model,df, type="response")
>
>
> ##Manually edit a coefficient for one of my categorical variable's levels
> Model$coefficients["JudicialOrientationNeutral"] <-log(50)
>
> ##Predict Losses again to compare
> df$PredictedLossPerUnit2 <- predict(Model, df, type ="response")
>
> df
Account Exposure JudicialOrientation Freq Losses LossPerUnit PredictedLossPerUnit PredictedLossPerUnit2
1 A 1 Neutral 0.008 100000 100000.00000 1549.56677 40213.38196
2 B 50 Neutral 0.500 100 2.00000 919.41825 23860.16405
3 C 67 Plaintiff 0.050 2500 37.31343 169.99221 169.99221
4 D 85 Defense 0.340 100000 1176.47059 565.49150 565.49150
5 E 250 Plaintiff 0.700 25000 100.00000 85.29641 85.29641
6 F 25 Neutral 0.000 0 0.00000 1562.77490 40556.15105
7 G 22 Plaintiff 0.040 7500 340.90909 171.80535 171.80535
8 H 89 Defense 0.120 5200 58.42697 714.15870 714.15870
I usually try to keep essential material on screen but here you will need to scroll over to see that the "Neutral" items in the two columns are different.
Edit: I left the creation of the formula outside since it was the least change possible, but a better strategy would have been to use just your formula without the "as.formula" wrapper, which shouldn't be needed and is going to have a different environment for later evaluation. First run: Model <- glm(LossPerUnit~JudicialOrientation+Freq, family = tweedie(var.power=1.5, link.power = 0), weight = Exposure, data = df) and then do your coefficient violence.