Search code examples
rmodelinglmmismatch

mismatch between descriptive analysis and slope estimation linear model R


I'm a noob student dealing with modelling in R.

I'm trying to find the best model for my dataset composed by n replicates rows x m variables columns: i would like to build a lm to explain the effects of 4 categorical regressors in Y (continous data) number of plant shoots/sq.m.

Formula model is: lm(Y ~ a+b+c+d).

Regressor levels: "a" with 4 levels (shading percentage classes), "b" with 4 levels (4 surveys years), "c" with 3 levels (elevation classes) and "d" with 7 levels (7 spatial polygons in which shoots were sampled).

In descriptive analysis i observed (with boxplots) strong decreasing in Y values for all regressor levels, in particular for categorical variable "a": its levels called "I" (100% light), "II" (60%), "III" (30%), "IV" (10% light), have 350, 250, 150 100 Y median values.

In the summary model i can observe the expected influences in Y values for each regressor level, except "a": levels of this regressor show an opposite relationship with Y with significative pvalues. It means that compared to I (included in the intercept), estimated slope value for level II is +69, for III +133 and for IV +150.

Diagnostic plots are ok with residuals normal distribution and variance homogeneity.

So my question is, is it possible this kind of influence or maybe i should read the summary in a different way?

Thanks in advance for your helpfulness.

Here you can see the distribution of each factors level included in the model

Summary and diagnostic plot


Solution

  • I marked your question to be migrated to cross-validated as it is more a statistics question really. Hope you get a more detailed answer there.

    In any case one potential cause of your mismatch is that one of your explanatory variables correlates with another. That wouldn't show up in your diagnostic plots. The correlating variable "causes" the decreasing density that you see in the summary plots. Once you remove that effect by including it in your regression the real effect shows in increasing density

    A quick check is to run a few test of association between your explanatory variables. Alternatively you can estimate the linear model step-wise adding one variable after the other to see if the signs of the shading variable change after you added a particular explanatory variable.