Search code examples
rpca

How to use PCA on test set (code)


I'm trying to use PCA to select some K principal components to work with.

I understand that one should NOT re-run PCA on the testing set, but use the eigenvectors \ PC's found when modeling the training set.

I Have 2 CSV's - One is training set,

The other a test set (without the label per record)

PCA process on the training set is done with the following code:

# Load CSV file
train_set.init_data <- read.csv("D:\\train.csv", header = TRUE)

# Remove identifier and respone variables (ID, and SalePrice):
train_set.vars <- subset(train_set.init_data, select = -c(Id, SalePrice))

# Convert catergorical variables into numerical using dummy variables:
library(dummies)
train_set.vars_dummy <- dummy.data.frame(train_set.vars, sep = ".")

# Principal Component Analysis:
train_set.prin_comp <- prcomp(train_set.vars_dummy, scale. = T)

# Choose some K components
????

# Run linear regression model based on PC's
<.....>

After I'm done building a model using the training set, I would need to load the testing set and run my prediction model on it.

The difficulties I'm having, in terms of 'How to code it?':

  1. How to extract K (Will be chosen based on scree plot) PC's after running PCA (on the training set), so the modeling for the training set will be based on those? (planning on linear regression)

  2. How to use K extracted PC's when wanting to run the model built on an actual testing set?

  3. Should I zero-mean the features in the testing set first, or scale STD of them? For the training set, I understand prcomp method already does that for me, so I'm not sure if I should do it manually on the testing set.

  4. Should I transform categorical variables of test set into numerical using dummy variables, as I've done with the training set?

I DO understand the basic - those same operations applied to the training set, should be applied to the testing set as well.

But - I'm not sure exactly what that means in terms of code.

Thanks


Solution

  • I'm using the USArrests dataset to give you an idea on the sequence of steps to be followed to perform PCA on test data.

    library(dplyr)
    library(tibble)
    data(USArrests)
    train <- USArrests %>% rownames_to_column(var = "rowname")
    test <- USArrests %>% rownames_to_column(var = "rowname")
    

    Approach 1 - Combined train & test

    # Join train and test set
    df <- bind_rows("train" = train, "test" = test, .id="group")
    # Run Principal Components Analysis
    pc <- prcomp(df %>% select(-rowname, -group), scale = TRUE)
    # Plot ScreePlot
    pc_var <- (pc$sdev^2)/sum(pc$sdev^2)
    plot(pc_var, xlab = "Principal Component", ylab = "Proportion of Variance Explained", type = "b")
    # Extract PCs (e.g. 1st 3 PCs)
    df <- augment(pc,df) %>% select(group, rowname, .fittedPC1 : .fittedPC3)
    # Split into train and test
    train <- df %>% filter(group == "train") %>% select(-group)
    test <-  df %>% filter(group == "test") %>% select(-group)
    

    In this approach the test data may leak into the train data.

    Approach 2 - Using predict() to transform test data from PCA loadings of train data

    # Run Principal Components Analysis
    pc <- prcomp(train %>% select(-rowname), scale = TRUE)
    # Extract PCs  (e.g. 1st 3 PCs)
    train <- tbl_df(pc1$x) %>% select(PC1:PC3)
    test <- tbl_df(predict(pc, newdata = test %>% select(-rowname))) %>% select(PC1:PC3)
    

    This approach is more robust compared to the earlier one.