Search code examples
rstatisticspca

Perform pca on replicate treatments instead of parameters


I have a dataset in form that column 1 contain treatments name and remaining columns contain values for those treatments and there are three replicates for each treatment. For illustration I have created simulated dataset using iris dataset as shown below:

df <- read.table(text = '"Treatment" "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
"treatment_a" 5.1 3.5 1.4 0.2
"treatment_a" 4.9 3 1.4 0.2
"treatment_a" 4.7 3.2 1.3 0.2
"treatment_b" 4.6 3.1 1.5 0.2
"treatment_b" 5 3.6 1.4 0.2
"treatment_b" 5.4 3.9 1.7 0.4
"treatment_c" 4.6 3.4 1.4 0.3
"treatment_c" 5 3.4 1.5 0.2
"treatment_c" 4.4 2.9 1.4 0.2
"treatment_d" 4.9 3.1 1.5 0.1
"treatment_d" 5.4 3.7 1.5 0.2
"treatment_d" 4.8 3.4 1.6 0.2
"treatment_e" 4.8 3 1.4 0.1
"treatment_e" 4.3 3 1.1 0.1
"treatment_e" 5.8 4 1.2 0.2
"treatment_f" 5.7 4.4 1.5 0.4
"treatment_f" 5.4 3.9 1.3 0.4
"treatment_f" 5.1 3.5 1.4 0.3
"treatment_g" 5.7 3.8 1.7 0.3
"treatment_g" 5.1 3.8 1.5 0.3
"treatment_g" 5.4 3.4 1.7 0.2
"treatment_h" 5.1 3.7 1.5 0.4
"treatment_h" 4.6 3.6 1 0.2
"treatment_h" 5.1 3.3 1.7 0.5', header = TRUE)

I want to perform a pca on this dataset using R in a way that treatments with replicates are plotted on the plot instead of variables, treatment names should also be labelled on the plot. I have looked for similar questions on stackoverflow but did not find one similar to my problem.


Solution

  • Original Response

    Are you looking to make a scatter plot with the first and second principle components plotted on the x and y axes, respectively? And then you want the points to be labeled with the treatments? If so, you can give this a shot. I am using the ggplot2 package.

    I also added a color aesthetic to the pot. Feel free to drop that part if you don't want it.

    df <- read.table(text = '"Treatment" "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
    "treatment_a" 5.1 3.5 1.4 0.2
    "treatment_a" 4.9 3 1.4 0.2
    "treatment_a" 4.7 3.2 1.3 0.2
    "treatment_b" 4.6 3.1 1.5 0.2
    "treatment_b" 5 3.6 1.4 0.2
    "treatment_b" 5.4 3.9 1.7 0.4
    "treatment_c" 4.6 3.4 1.4 0.3
    "treatment_c" 5 3.4 1.5 0.2
    "treatment_c" 4.4 2.9 1.4 0.2
    "treatment_d" 4.9 3.1 1.5 0.1
    "treatment_d" 5.4 3.7 1.5 0.2
    "treatment_d" 4.8 3.4 1.6 0.2
    "treatment_e" 4.8 3 1.4 0.1
    "treatment_e" 4.3 3 1.1 0.1
    "treatment_e" 5.8 4 1.2 0.2
    "treatment_f" 5.7 4.4 1.5 0.4
    "treatment_f" 5.4 3.9 1.3 0.4
    "treatment_f" 5.1 3.5 1.4 0.3
    "treatment_g" 5.7 3.8 1.7 0.3
    "treatment_g" 5.1 3.8 1.5 0.3
    "treatment_g" 5.4 3.4 1.7 0.2
    "treatment_h" 5.1 3.7 1.5 0.4
    "treatment_h" 4.6 3.6 1 0.2
    "treatment_h" 5.1 3.3 1.7 0.5', header = TRUE)
    
    # run principle components, ignore first column
    pr <- prcomp(df[, 2:5])
    
    # run predict to get the first and second principle components
    pr_pred <- predict(pr)
    
    # put this into a data frame so we can use ggplot
    df2 <- data.frame(Treatment = df$Treatment,
                      pr_pred[, 1:2])
    
    library(ggplot2)
    
    ggplot(data = df2, aes(x = PC1, y = PC2, 
                           colour = Treatment, 
                           label = Treatment)) + 
        geom_text()
    

    enter image description here

    Added ellipses

    To add these, we'll have to change how many categories there are. We'll go with three. Hopefully, in your actual data set, there are enough to draw the ellipses you are looking for.

    df_mod <- read.table(text = '"Treatment" "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
    "treatment_a" 5.1 3.5 1.4 0.2
                     "treatment_a" 4.9 3 1.4 0.2
                     "treatment_a" 4.7 3.2 1.3 0.2
                     "treatment_b" 4.6 3.1 1.5 0.2
                     "treatment_b" 5 3.6 1.4 0.2
                     "treatment_b" 5.4 3.9 1.7 0.4
                     "treatment_c" 4.6 3.4 1.4 0.3
                     "treatment_c" 5 3.4 1.5 0.2
                     "treatment_c" 4.4 2.9 1.4 0.2
                     "treatment_a" 4.9 3.1 1.5 0.1
                     "treatment_a" 5.4 3.7 1.5 0.2
                     "treatment_a" 4.8 3.4 1.6 0.2
                     "treatment_b" 4.8 3 1.4 0.1
                     "treatment_b" 4.3 3 1.1 0.1
                     "treatment_b" 5.8 4 1.2 0.2
                     "treatment_c" 5.7 4.4 1.5 0.4
                     "treatment_c" 5.4 3.9 1.3 0.4
                     "treatment_c" 5.1 3.5 1.4 0.3
                     "treatment_a" 5.7 3.8 1.7 0.3
                     "treatment_a" 5.1 3.8 1.5 0.3
                     "treatment_b" 5.4 3.4 1.7 0.2
                     "treatment_b" 5.1 3.7 1.5 0.4
                     "treatment_c" 4.6 3.6 1 0.2
                     "treatment_c" 5.1 3.3 1.7 0.5', header = TRUE)
    
    
    pr_mod <- prcomp(df_mod[, 2:5])
    pr_pred_mod <- predict(pr_mod)
    
    df2_mod <- data.frame(Treatment = df_mod$Treatment,
                      pr_pred_mod[, 1:2])
    
    ggplot(data = df2_mod, aes(x = PC1, y = PC2, 
                           colour = Treatment, 
                           label = Treatment)) + 
        geom_text() + 
        stat_ellipse(show.legend = FALSE)
    

    enter image description here