Search code examples
rstatisticsclassificationpca

How to compare a 3 sets of data in order to sort out how 2 of these data influence the third one?


I have a 3 set data like this:

enter image description here

There is a tool to say what is the most important variable in the removal? Is pH or dosage? I was thinking in a PCA (principal component analysis) however I'm a little lost


Solution

  • Here are some things to try.

    From the plot it seems clear that Dosage (column 2) is more closely related to Removal (column 3) than pH (column 1).

    Also Dosage has a 61% correlation with Removal whereas pH has a correlation of only -14%.

    Neither variable is statistically significant in the lm summary output likely because of the small amount of data.

    Stepwise regression based on AIC chooses the Removal ~ Dosage model.

    (continued after graph)

    matplot(scale(DF), type = "o")
    

    screenshot

    cor(DF)
    ##                 pH    Dosage    Removal
    ## pH       1.0000000 0.0000000 -0.1418573  <-- -14%
    ## Dosage   0.0000000 1.0000000  0.6091517  <-- 61%
    ## Removal -0.1418573 0.6091517  1.0000000
    
    summary(lm(Removal ~., DF))
    
    ## Call:
    ## lm(formula = Removal ~ ., data = DF)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -15.5556  -7.0556  -4.8889   0.7778  25.7778 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)
    ## (Intercept)   69.056     39.047   1.769    0.127  
    ## pH            -2.833      6.362  -0.445    0.672  <-- not significant
    ## Dosage        12.167      6.362   1.912    0.104  <-- not significant
    ## 
    ## Residual standard error: 15.58 on 6 degrees of freedom
    ## Multiple R-squared:  0.3912,    Adjusted R-squared:  0.1883 
    ## F-statistic: 1.928 on 2 and 6 DF,  p-value: 0.2257
    
    fm <- step(lm(Removal ~., DF))
    ## ...snip...
    
    fm
    ## Call:
    ## lm(formula = Removal ~ Dosage, data = DF)
    ## 
    ## Coefficients:
    ## (Intercept)       Dosage  
    ##       52.06        12.17  
    

    Note: The input data in reproducible form is:

    DF <- structure(list(pH = c(5, 5, 5, 6, 6, 6, 7, 7, 7), Dosage = c(0L, 
    1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L), Removal = c(50, 60, 70, 50, 
    90, 95, 50, 55, 58)), .Names = c("pH", "Dosage", "Removal"), row.names = c(NA, 
    -9L), class = "data.frame")