Search code examples
rpca

How to calculate principal component analysis to plot graph showing PC1 vs PC2 using my data set?


Please could somebody suggest how I would go about making a principle component analysis with the gene data set I have.

I have a table containing 15 columns. The first one is disease group, where 0 is control, 1 is Ulcerative Colitis and 2 stands for Crohn’s.

The remaining 14 columns correspond to 14 different genes.

I would like to plot a PC1 vs PC2 following PCA (via prcomp), to show whether any clustering or separation between the three groups occurs based on gene expression data ( with each axis showing the proportion of variance). However, I am struggling to know where to start, as I cannot convert my column 1 to row names via row.names=1 as R doesn’t allow repeating row names.

Converting to a matrix and trying to use the below code, does not work.

mockdata1 <- as.matrix(mockdata)
rownames(mockdata1) <- mockdata1[,1]
mockdata1[,1] <- NULL
or 
mockdata2 <-mockdata1 [ ,-1]

With the previous examples that I have done, I have been able to compute the PCA and plot the PCA1 vs PCA2 and colour the data accordingly, following row.names=1, but not sure how to overcome this first initial problem, as I can't use this here.

I have included my data below via dput(head(mockdata))

structure(list(Disease = c(1L, 1L, 0L, 0L, 2L, 2L), Gene1 = c(9104.774619, 
35924.12358, 6.780294688, 1284.690716, 69.50341155, 3935.107345
), Gene2 = c(5224.114486, 35625.73119, 18.35291351, 511.9272679, 
186.7270146, 47611.65544), Gene3 = c(1472.348466, 137571.5525, 
20.78531289, 3019.140256, 146.9615338, 108935.1303), Gene4 = c(2487.124686, 
147604.774, 3.574347972, 1371.576262, 210.6773417, 82831.97458
), Gene5 = c(1872.328747, 235675.6461, 9.834667594, 583.1631957, 
120.6931223, 75874.49936), Gene6 = c(1675.724728, 35931.1852, 
9.91026361, 1634.038443, 58.04818134, 23502.78972), Gene7 = c(3775.885073, 
169672.9921, 5.41305941, 929.2125312, 97.72621248, 46023.7009
), Gene8 = c(5015.202216, 137455.0032, 2.995124554, 1113.882634, 
83.17636201, 14048.19237), Gene9 = c(883.5716868, 45920.44167, 
6.399646876, 892.313155, 117.1104906, 10825.47974), Gene10 = c(1607.790858, 
146627.0588, 1.967559425, 1237.299298, 90.8941744, 32747.04713
), Gene11 = c(2345.478241, 91047.57303, 12.33867961, 663.576224, 
384.5839119, 6692.728154), Gene12 = c(2772.362496, 15511.96753, 
15.64843017, 4143.085461, 169.545757, 22484.03574), Gene13 = c(4131.51741, 
48601.7059, 21.66175797, 2250.0628, 316.0677196, 16612.6508), 
Gene14 = c(1252.440598, 54794.36695, 2.925615978, 708.0342528, 
211.822519, 14021.28425)), row.names = c(NA, 6L), class = "data.frame")

Solution

  • I read about the statistics behind PCA a long time ago. It basically generates df*t(df). If you transpose this you could get 14 PCs across the genes. Or you could do this across the 6 samples. Not sure which one you are interested in, comparing genes, or diseases?

    For now, just change the Disease column.

    df1 <- df[,-1]
    
    prcomp(df1, scale=TRUE)
    
    Standard deviations (1, .., p=6):
    [1] 3.535100e+00 1.197773e+00 2.341240e-01 1.164908e-01 4.575297e-03 1.341079e-16
    
    Rotation (n x k) = (14 x 6):
                 PC1         PC2         PC3          PC4         PC5         PC6
    Gene1  0.2599087 -0.28372577  0.85593436  0.105415154  0.22237406 -0.03275210
    Gene2  0.2217516  0.51374227  0.19750686  0.587702793 -0.28955565 -0.05771744
    Gene3  0.2658620  0.28239514 -0.18315564  0.178995551  0.20110401  0.55717953
    Gene4  0.2793220  0.12737931 -0.15464656  0.169556323  0.07195134 -0.18818707
    Gene5  0.2818335 -0.06260368 -0.17524380  0.066508759  0.19418520 -0.23609300
    Gene6  0.2753095  0.19146854 -0.05941163 -0.009035774  0.27267048 -0.02872877
    Gene7  0.2804052 -0.10775145 -0.11519116  0.035882068  0.20239393 -0.18183889
    Gene8  0.2697234 -0.25139746 -0.04378742 -0.067463492  0.21022545  0.41231175
    Gene9  0.2787618 -0.13908918 -0.13407274 -0.103917770 -0.05348686  0.14863994
    Gene10 0.2782801 -0.14660837 -0.15870274 -0.039789645  0.20346369 -0.24515625
    Gene11 0.2672274 -0.27313184 -0.09102046 -0.087085239 -0.41532869  0.18704109
    Gene12 0.2089616  0.55676419  0.20707121 -0.730763260  0.02663728 -0.08109933
    Gene13 0.2819621 -0.06202963  0.11256511 -0.133063115 -0.54206507  0.22546812
    Gene14 0.2797161 -0.12219139 -0.12034622 -0.025956127 -0.33551757 -0.46422864
    

    for PCs across diseses transpose your matrix:

    df2 <- t(df)
    colnames(df2) <- df$Disease
    df3 <- df2[-1,]
    prcomp(df3,scale=TRUE)
    
    Standard deviations (1, .., p=6):
    [1] 1.3881805 1.3086775 1.0332001 0.8999575 0.5686278 0.3994426
    
    Rotation (n x k) = (6 x 6):
             PC1        PC2         PC3        PC4        PC5         PC6
    1 -0.1761229 -0.4084611  0.42521515 -0.7364243  0.2424742 -0.14218944
    1  0.5843042  0.3073572 -0.04722848 -0.2941286  0.4419985  0.52916471
    0 -0.5145338  0.4178422  0.04746496 -0.3238892 -0.4356411  0.51353922
    0 -0.4042091  0.3652627  0.47093069  0.3634759  0.5908253 -0.01526729
    2 -0.3531166  0.1750554 -0.74358399 -0.2618183  0.3973516 -0.25555829
    2  0.2734007  0.6324854  0.20003938 -0.2561246 -0.2215796 -0.60868811