Search code examples
rstatisticsregressionstat

Issue with a multiple regression model in R


First let me apologize but I'm a biologist starting in the world of bioinformatics and therefore in R programming and statistics. I have to do an analysis of a multilinear regression model with the data (Penta) from Library(mvdalav).

I have to try different models including the PLS model that is the model that is normally used for this data set (https://rdrr.io/cran/mvdalab/f/README.md)

However, they ask us to play with the data more models and I'm very lost as the data seems to always give me errors:

1) Normal multiple regression model:


> mod2<-mod1<-lm(Penta1$log.RAI~.,Penta1)
> summary(mod2)

Call:
lm(formula = Penta1$log.RAI ~ ., data = Penta1)

Residuals:
ALL 30 residuals are 0: no residual degrees of freedom!

Coefficients: (15 not defined because of singularities)
                Estimate Std. Error t value Pr(>|t|)
(Intercept)   -1.000e-01         NA      NA       NA
Obs.NameAAWAA  8.500e-01         NA      NA       NA
Obs.NameAAYAA  5.600e-01         NA      NA       NA
Obs.NameEKWAP  1.400e+00         NA      NA       NA
Obs.NameFEAAK  4.000e-01         NA      NA       NA
Obs.NameFSPFR  7.400e-01         NA      NA       NA
Obs.NameGEAAK -4.200e-01         NA      NA       NA
Obs.NameLEAAK  5.000e-01         NA      NA       NA
Obs.NamePGFSP  1.000e+00         NA      NA       NA
Obs.NameRKWAP  2.080e+00         NA      NA       NA
Obs.NameRYLPT  5.000e-01         NA      NA       NA
Obs.NameVAAAK  1.114e-15         NA      NA       NA
Obs.NameVAAWK  3.300e-01         NA      NA       NA
Obs.NameVAWAA  1.530e+00         NA      NA       NA
Obs.NameVAWAK  1.550e+00         NA      NA       NA
Obs.NameVEAAK  6.100e-01         NA      NA       NA
Obs.NameVEAAP  2.800e-01         NA      NA       NA
Obs.NameVEASK  3.000e-01         NA      NA       NA
Obs.NameVEFAK  1.670e+00         NA      NA       NA
Obs.NameVEGGK -9.000e-01         NA      NA       NA
Obs.NameVEHAK  1.630e+00         NA      NA       NA
Obs.NameVELAK  6.900e-01         NA      NA       NA
Obs.NameVESAK  3.800e-01         NA      NA       NA
Obs.NameVESSK  1.000e-01         NA      NA       NA
Obs.NameVEWAK  2.830e+00         NA      NA       NA
Obs.NameVEWVK  1.810e+00         NA      NA       NA
Obs.NameVKAAK  2.100e-01         NA      NA       NA
Obs.NameVKWAA  1.810e+00         NA      NA       NA
Obs.NameVKWAP  2.450e+00         NA      NA       NA
Obs.NameVWAAK  1.400e-01         NA      NA       NA
S1                    NA         NA      NA       NA
L1                    NA         NA      NA       NA
P1                    NA         NA      NA       NA
S2                    NA         NA      NA       NA
L2                    NA         NA      NA       NA
P2                    NA         NA      NA       NA
S3                    NA         NA      NA       NA
L3                    NA         NA      NA       NA
P3                    NA         NA      NA       NA
S4                    NA         NA      NA       NA
L4                    NA         NA      NA       NA
P4                    NA         NA      NA       NA
S5                    NA         NA      NA       NA
L5                    NA         NA      NA       NA
P5                    NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 29 and 0 DF,  p-value: NA

2) Study the reduced model provided by the stepwise method. The aim is to compare the RMSE of the reduced model and the complete model for the training group and for the test group.

step(lm(log.RAI~.,data = penta),direction = "backward") Error in step(lm(log.RAI ~ ., data = penta), direction = "backward") : AIC is -infinity for this model, so 'step' cannot proceed

3)Find the best model by the criteria of the AIC and by the adjusted R2

4) PLS model --> what fits the data following:https://rdrr.io/cran/mvdalab/f/README.md

5)Also study it with the Ridge Regression method with the lm.ridge () function or similar 6) Finally we will study the LASSO method with the lars () function of Lasso project.

I'm super lost with why the data.frame gave those errors and also how to develop the analysis. Any help with any of the parts would be much appreciated

Kind regards


Solution

  • Ok after reading the vignette, Penta is some data obtained from drug discovery and the first column is the unique identifier. To do regression or downstream analysis you need to exclude this column. For the steps below, I simply do Penta[,-1] as input data

    For the first part, this works:

    library(mvdalab)
    data(Penta)
    summary(lm(log.RAI~.,data = Penta[,-1]))
    
    Call:
    lm(formula = log.RAI ~ ., data = Penta[, -1])
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.39269 -0.12958 -0.05101  0.07261  0.63414 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
    (Intercept) -0.80263    0.92495  -0.868  0.40016   
    S1          -0.09783    0.03895  -2.512  0.02489 * 
    L1           0.03236    0.04973   0.651  0.52576   
    P1          -0.10795    0.08521  -1.267  0.22587   
    S2           0.08670    0.04428   1.958  0.07043 . 
    

    Second part for AIC is ok as well:

     step(lm(log.RAI~.,data = Penta[,-1]),direction="backward")
    Start:  AIC=-57.16
    log.RAI ~ S1 + L1 + P1 + S2 + L2 + P2 + S3 + L3 + P3 + S4 + L4 + 
        P4 + S5 + L5 + P5
    
           Df Sum of Sq    RSS     AIC
    - P3    1   0.00150 1.5374 -59.132
    - L4    1   0.00420 1.5401 -59.080
    

    If you want to select model with AIC, the one above works. For adjusted R^2 i think most likely there are packages out there that does this

    For lm.ridge, do the same:

    library(MASS)
    fit=lm.ridge(log.RAI~.,data = Penta[,-1])
    

    For lars, lasso, you need to have the predictors etc in a matrix, so let's do

    library(lars)
    data = as.matrix(Penta[,-1])
    fit = lars(x=data[,-ncol(data)],y=data[,"log.RAI"],type="lasso")