kernel PCA with Kernlab and classification of Colon--cancer dataset

I need to Perform kernel PCA on the colon-‐cancer dataset:

and then

I need to Plot number of principal components vs classification accuracy with PCA data.

For the first part i am using kernlab in R as follows (let number of features be 2 and then i will vary it from say 2-100)

kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=2)

I am having tough time to understand how to use this PCA data for classification ( i can use any classifier for eg SVM)

EDIT : My Question is how to feed the output of PCA into a classifier

data looks like this (cleaned data)

colon cancer cleaned data

uncleaned original data looks like this colon cancer Uncleaned data

Solution

I will show you a small example on how to use the kpca function of the kernlab package here:

I checked the colon-cancer file but it needs a bit of cleaning to be able to use it so I will use a random data set to show you how:

Assume the following data set:

y <- rep(c(-1,1), c(50,50))
x1 <- runif(100)
x2 <- runif(100)
x3 <- runif(100)
x4 <- runif(100)
x5 <- runif(100)
df <- data.frame(y,x1,x2,x3,x4,x5)

> df
     y          x1          x2          x3         x4          x5
1   -1 0.125841208 0.040543611 0.317198114 0.40923767 0.635434021
2   -1 0.113818719 0.308030825 0.708251147 0.69739496 0.839856000
3   -1 0.744765204 0.221210582 0.002220568 0.62921565 0.907277935
4   -1 0.649595597 0.866739474 0.609516644 0.40818013 0.395951297
5   -1 0.967379006 0.926688915 0.847379556 0.77867315 0.250867680
6   -1 0.895060293 0.813189446 0.329970821 0.01106764 0.123018797
7   -1 0.192447416 0.043720717 0.170960540 0.03058768 0.173198036
8   -1 0.085086619 0.645383728 0.706830885 0.51856286 0.134086770
9   -1 0.561070374 0.134457795 0.181368729 0.04557505 0.938145228

In order to run the pca you need to do:

kpc <- kpca(~.,data=data[,-1],kernel="rbfdot",kpar=list(sigma=0.2),features=4)

which is the same way as you use it. However, I need to point out that the features argument is the number of principal components and not the number of classes in your y variable. Maybe you knew this already but having 2000 variables and producing only 2 principal components might not be what you are looking for. You need to choose this number carefully by checking the eigen values. In your case I would probably pick 100 principal components and chose the first n number of principal components according to the highest eigen values. Let's see this in my random example after running the previous code:

In order to see the eigen values:

> kpc@eig 
    Comp.1     Comp.2     Comp.3     Comp.4 
0.03756975 0.02706410 0.02609828 0.02284068

In my case all of the components have extremely low eigen values because my data is random. In your case I assume you will get better ones. You need to choose the n number of components that have the highest values. A value of zero shows that the component does not explain any of the variance. (Just for the sake of the demonstration I will use all of them in the svm below).

In order to access the principal components i.e. the PCA output you do this:

> kpc@pcv
                [,1]        [,2]         [,3]        [,4]
  [1,] -0.1220123051  1.01290883 -0.935265092  0.37279158
  [2,]  0.0420830469  0.77483019 -0.009222970  1.14304032
  [3,] -0.7060568260  0.31153129 -0.555538694 -0.71496666
  [4,]  0.3583160509 -0.82113573  0.237544936 -0.15526000
  [5,]  0.1158956953 -0.92673486  1.352983423 -0.27695507
  [6,]  0.2109994978 -1.21905573 -0.453469345 -0.94749503
  [7,]  0.0833758766  0.63951377 -1.348618472 -0.26070127
  [8,]  0.8197838629  0.34794455  0.215414610  0.32763442
  [9,] -0.5611750477 -0.03961808 -1.490553198  0.14986663
  ...
  ...

This returns a matrix of 4 columns i.e. the number of the features argument which is the PCA output i.e. the principal components. kerlab uses the S4 Method Dispatch System and that is why you use @ at kpc@pcv.

You then need to use the above matrix to feed in an svm in the following way:

svmmatrix <- kpc@pcv
library(e1071)
svm(svmmatrix, as.factor(y))

Call:
svm.default(x = svmmatrix, y = as.factor(y))

Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.25 

Number of Support Vectors:  95

And that's it! A very good explanation I found on the internet about pca can be found here in case you or anyone else reading this wants to find out more.