I'm trying to run a PCA on the "training1" data set created below:
library(AppliedPredictiveModeling); data(AlzheimerDisease); library(caret)
adData <- data.frame(diagnosis, predictors)
inTrain <- createDataPartition(y = adData$diagnosis, p = .75)[[1]]
training <- adData[inTrain, ]
keep <- subset(data.frame(x = substr(as.character(colnames(training)), 1, 2), y = c(1:ncol(training))), x == "IL")
training1 <- cbind(training[, c(keep[1, 2]:keep[nrow(keep), 2])], training[c("diagnosis")])
Then, when I run the following function:
preProc <- preProcess(log10(training1[, -13]+1), method = "pca", pcaComp = 2)
I get the following error:
Warning in preProcess.default(log10(training1[, -13] + 1), method = "pca", :
Std. deviations could not be computed for: IL_1alpha, IL_3
Error in prcomp.default(x[, method$pca, drop = FALSE], scale = TRUE, retx = FALSE) :
cannot rescale a constant/zero column to unit variance
However, I then run run the following two functions to prove that standard deviations can be calculated for the two variables it says that it can't calculate them for:
sd(training1$IL_1alpha)
[1] 0.4056147
sd(training1$IL_3)
[1] 0.5235212
And then run the following function to prove that I do not have any variables with a zero variance.
nsv <- nearZeroVar(training1, saveMetrics = TRUE)
> print(nsv)
freqRatio percentUnique zeroVar nzv
IL_11 1.250000 29.4820717 FALSE FALSE
IL_13 1.052632 6.7729084 FALSE FALSE
IL_16 1.117647 21.9123506 FALSE FALSE
IL_17E 1.238095 16.7330677 FALSE FALSE
IL_1alpha 1.208333 23.1075697 FALSE FALSE
IL_3 1.066667 24.7011952 FALSE FALSE
IL_4 1.315789 19.1235060 FALSE FALSE
IL_5 1.000000 19.5219124 FALSE FALSE
IL_6 1.000000 20.3187251 FALSE FALSE
IL_6_Receptor 1.041667 21.5139442 FALSE FALSE
IL_7 1.611111 18.7250996 FALSE FALSE
IL_8 1.000000 22.3107570 FALSE FALSE
diagnosis 2.637681 0.7968127 FALSE FALSE
It seems like other people's issues with PCA in R were around zero variance columns, but since I can prove that I don't have that issue here, any ideas what may be causing the issue?
Sorry, I don't have the rep to comment, so posting as an answer, but after running your code, in particular this line:
log10(training1[, -13]+1)
returns NaN
values in some columns (IL_1alpha
and IL_3
actually):
Warning messages:
1: In lapply(X = x, FUN = .Generic, ...) : NaNs produced
So that seems to be the source of the error. Maybe you shouldn't take log's of negative numbers and think of other transformation instead (or whether it is necessary at all)?