Search code examples
rlinear-regressionanova

Log transformation for lm() in R not working


I am trying to transform some data so that the assumptions of linear models (independence, linearity, homogeneity of variance, normality) are met. I want to do this so that I can perform an ANOVA or similar. Square root transforming the response variable within my linear model has worked, but an error appears when I try to log transform.

I have tried: logCC_emergent_biomass.lm <- lm(log(Total_CC_noAcari_Biomass)~ Dungfauna*Water*Earthworms, data= biomass)

But this error appears: Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y'

Normally log transforming in this way works for me so I am not sure what is wrong here. The data of the response variable is all decimal data (e.g. 0.001480370), potentially this is the cause? If this is the case can anyone point me in the direction of how I can transform this data.

This is these are residuals plots when the data is untransformed: enter image description here


Solution

  • You probably have zeroes in the variable you want to log transform. Example:

    df1[1, 1] <- 0
    
    lm(Y ~ log(X1) + X2 + X3, df1)
    # Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
    #   NA/NaN/Inf in 'x'
    # In addition: Warning message:
    #   In log(X1) : NaNs produced
    

    You could consider log1p which calculates log(x + 1).

    lm(Y ~ log1p(X1) + X2 + X3, df1)
    # Call:
    # lm(formula = Y ~ log1p(X1) + X2 + X3, data = df1)
    # 
    # Coefficients:
    #   (Intercept)    log1p(X1)           X2           X3
    #        0.9963      -0.8648       0.5293       1.0904 
    

    However, this changes the interpretation, see related post on Cross Validated. Anyway, you should decide what to do with the zero values.

    Also see this post: How should I transform non-negative data including zeros?


    Data:

    df1 <- structure(list(X1 = c(0, -0.564698171396089, 0.363128411337339, 
    0.63286260496104, 0.404268323140999, -0.106124516091484, 1.51152199743894, 
    -0.0946590384130976, 2.01842371387704, -0.062714099052421), X2 = c(1.30486965422349, 
    2.28664539270111, -1.38886070111234, -0.278788766817371, -0.133321336393658, 
    0.635950398070074, -0.284252921416072, -2.65645542090478, -2.44046692857552, 
    1.32011334573019), X3 = c(-0.306638594078475, -1.78130843398, 
    -0.171917355759621, 1.2146746991726, 1.89519346126497, -0.4304691316062, 
    -0.25726938276893, -1.76316308519478, 0.460097354831271, -0.639994875960119
    ), Y = c(2.00627879909717, 1.08150911284604, 1.41465103918476, 
    1.37787039819613, 3.04863502238068, -0.828228728348569, 0.198328716326719, 
    -2.34295203837687, -1.61863179473641, 1.03962922460575)), row.names = c(NA, 
    -10L), class = "data.frame")