Search code examples
rstatisticskernelclassificationpca

kernel PCA with Kernlab and classification of Colon--cancer dataset


I need to Perform kernel PCA on the colon-­‐cancer dataset:

and then

I need to Plot number of principal components vs classification accuracy with PCA data.

For the first part i am using kernlab in R as follows (let number of features be 2 and then i will vary it from say 2-100)

kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=2)

I am having tough time to understand how to use this PCA data for classification ( i can use any classifier for eg SVM)

EDIT : My Question is how to feed the output of PCA into a classifier

data looks like this (cleaned data)

colon cancer cleaned data

uncleaned original data looks like this colon cancer Uncleaned data


Solution

  • I will show you a small example on how to use the kpca function of the kernlab package here:

    I checked the colon-cancer file but it needs a bit of cleaning to be able to use it so I will use a random data set to show you how:

    Assume the following data set:

    y <- rep(c(-1,1), c(50,50))
    x1 <- runif(100)
    x2 <- runif(100)
    x3 <- runif(100)
    x4 <- runif(100)
    x5 <- runif(100)
    df <- data.frame(y,x1,x2,x3,x4,x5)
    
    > df
         y          x1          x2          x3         x4          x5
    1   -1 0.125841208 0.040543611 0.317198114 0.40923767 0.635434021
    2   -1 0.113818719 0.308030825 0.708251147 0.69739496 0.839856000
    3   -1 0.744765204 0.221210582 0.002220568 0.62921565 0.907277935
    4   -1 0.649595597 0.866739474 0.609516644 0.40818013 0.395951297
    5   -1 0.967379006 0.926688915 0.847379556 0.77867315 0.250867680
    6   -1 0.895060293 0.813189446 0.329970821 0.01106764 0.123018797
    7   -1 0.192447416 0.043720717 0.170960540 0.03058768 0.173198036
    8   -1 0.085086619 0.645383728 0.706830885 0.51856286 0.134086770
    9   -1 0.561070374 0.134457795 0.181368729 0.04557505 0.938145228
    

    In order to run the pca you need to do:

    kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=4)
    

    which is the same way as you use it. However, I need to point out that the features argument is the number of principal components and not the number of classes in your y variable. Maybe you knew this already but having 2000 variables and producing only 2 principal components might not be what you are looking for. You need to choose this number carefully by checking the eigen values. In your case I would probably pick 100 principal components and chose the first n number of principal components according to the highest eigen values. Let's see this in my random example after running the previous code:

    In order to see the eigen values:

    > kpc@eig 
        Comp.1     Comp.2     Comp.3     Comp.4 
    0.03756975 0.02706410 0.02609828 0.02284068 
    

    In my case all of the components have extremely low eigen values because my data is random. In your case I assume you will get better ones. You need to choose the n number of components that have the highest values. A value of zero shows that the component does not explain any of the variance. (Just for the sake of the demonstration I will use all of them in the svm below).

    In order to access the principal components i.e. the PCA output you do this:

    > kpc@pcv
                    [,1]        [,2]         [,3]        [,4]
      [1,] -0.1220123051  1.01290883 -0.935265092  0.37279158
      [2,]  0.0420830469  0.77483019 -0.009222970  1.14304032
      [3,] -0.7060568260  0.31153129 -0.555538694 -0.71496666
      [4,]  0.3583160509 -0.82113573  0.237544936 -0.15526000
      [5,]  0.1158956953 -0.92673486  1.352983423 -0.27695507
      [6,]  0.2109994978 -1.21905573 -0.453469345 -0.94749503
      [7,]  0.0833758766  0.63951377 -1.348618472 -0.26070127
      [8,]  0.8197838629  0.34794455  0.215414610  0.32763442
      [9,] -0.5611750477 -0.03961808 -1.490553198  0.14986663
      ...
      ...
    

    This returns a matrix of 4 columns i.e. the number of the features argument which is the PCA output i.e. the principal components. kerlab uses the S4 Method Dispatch System and that is why you use @ at kpc@pcv.

    You then need to use the above matrix to feed in an svm in the following way:

    svmmatrix <- kpc@pcv
    library(e1071)
    svm(svmmatrix, as.factor(y))
    
    Call:
    svm.default(x = svmmatrix, y = as.factor(y))
    
    Parameters:
       SVM-Type:  C-classification 
     SVM-Kernel:  radial 
           cost:  1 
          gamma:  0.25 
    
    Number of Support Vectors:  95
    

    And that's it! A very good explanation I found on the internet about pca can be found here in case you or anyone else reading this wants to find out more.