Search code examples
rlinear-regression

Linear model with boxcox for data frame with zeros. Unable to predict for required values


I am trying to use boxcox to normalise the data I have. But I generate a model which can't predict at the conditions I want. Why is this happening?

I have a dataframe:

    a<-data.frame(Output=c(0.065,8.00,2.320,0.128,42.500,35.200,18.200,2.94,1.68,13.90,43.50,3.810,2.600),
                  Carbon=c(20.0,22.5,10.0,7.0,35.0,20.,35.0,2.0,10.0,25.0,30.0,10.0,8.0),               
                  Cooling=c(0.0,50.0,12.0,0.0,12.70,12.70,5.0,2.0,0.00,0.00,12.70,10.00,14.69),
                  Drying=c(0.0,70.00,0.00,0.00,0.90,0.90,0.90,55.80,0.00,0.00,0.90,15.00,35.56))

Using the following libraries:

library(MASS)

I ran the following codes:

bc<-boxcox(a$Output~a$Cooling*a$Drying+a$Carbon)
lambda<-bc$x[which.max(bc$y)]
new.model<-lm(((a$Output^lambda-1)/lambda)~a$Drying*a$Cooling+a$Carbon)

There are zeros in the dataset and want to transform them so I get normality. With that I want to build a predictive model and test "Output" for the following condition: Carbon=2, Cooling=10, Drying=20

However, I keep getting NaN's in my output. Have I done the transformation incorrectly or is the model flawed?


Solution

  • I think you should not use $ the way you have used it, since if you use that way, the coefficients are created like a$some_variable, while predicting the names of variables are however some_variable not a$some_variable in your given test record, You can try below approach. Please let me know if it fixes your issue.

    bc<-boxcox(Output~ Cooling* Drying + Carbon, data=a)
    lambda<-bc$x[which.max(bc$y)]
    a$lambda <- lambda
    new.model<-lm(((Output^lambda-1)/lambda)~Drying* Cooling+ Carbon, data=a)
    
    predict(new.model, data.frame(Carbon=2, Cooling=10, Drying=10, lambda = lambda))
    

    Output:

               1 
    0.1812739866 
    

    A look at what happen when you use $ approach for lms:

                           Estimate   Std. Error  t value  Pr(>|t|)   
    (Intercept)        -3.141173410  1.342601277 -2.33962 0.0474440 * 
    a$Drying            0.060882585  0.039681152  1.53429 0.1635024   
    a$Cooling           0.275926915  0.102135431  2.70158 0.0270079 * 
    a$Carbon            0.219900733  0.059038120  3.72472 0.0058317 **
    a$Drying:a$Cooling -0.004854491  0.001593430 -3.04657 0.0159038 * 
    

    However without $, this would look like:

    Coefficients:
                       Estimate   Std. Error  t value  Pr(>|t|)   
    (Intercept)    -3.141173410  1.342601277 -2.33962 0.0474440 * 
    Drying          0.060882585  0.039681152  1.53429 0.1635024   
    Cooling         0.275926915  0.102135431  2.70158 0.0270079 * 
    Carbon          0.219900733  0.059038120  3.72472 0.0058317 **
    Drying:Cooling -0.004854491  0.001593430 -3.04657 0.0159038 *