Search code examples
wekacovariancecorrelationpca

Doing PCA in Weka


I am trying to do PCA for dimension reduction in WEKA (Classification Problem).

I have 200 attributes in my data and close to 2100 rows.

Here are the steps that i follow

  • Import csv file in WEKA explorer

  • In preprocess tab, apply, Normalize data (To bring entire data in range of [0,1]

  • Then implement PCA.

    • In options for PCA, there is an option for centerData which if set to False, would calculate using correlation matrix after standardizing data (Correct me if i am wrong) and if set to true would using covariance matrix.

My doubt is

  1. Should i be normalizing data before implementing PCA or not? I tried doing it before and after normalizing i am getting different results. So i am confused.
  2. Should i Standardize data (bring mean to 0) and then apply PCA.

What is the option that i should select in PCA WEKA for centerData option in either case?


Solution

  • This question has been answered in part here: PCA first or normalization first?

    To answer your questions directly:

    Normalizing would be a personal choice. If you set centerData=TRUE, and do not normalize or standardize your data, some attributes with large values will have greater influence in the PCA. If you set centerData=FALSE, Weka standardizes the data for you.

    And just to confirm your suspicions, in Weka, centerData does the following:

    centerData=TRUE

    • Centers your data (does not normalize or standardize, so if you decide to do that, you need to do it before)
    • PCA is performed with the covariance matrix

    centerData=FALSE

    • PCA is performed with the correlation matrix (data is standardized by the method)