I have a data set called Data
, with 30 scaled and centered features and 1 outcome with column name OUTCOME
, referred to 700k records, stored in data.table
format. I computed its PCA, and observed that its first 8 components account for the 95% of the variance. I want to train a random forest in h2o
, so this is what I do:
Data.pca=prcomp(Data,retx=TRUE) # compute the PCA of Data
Data.rotated=as.data.table(Data.pca$x)[,c(1:8)] # keep only first 8 components
Data.dump=cbind(Data.rotated,subset(Data,select=c(OUTCOME))) # PCA dataset plus outcomes for training
This way I have a dataset Data.dump
where I have 8 features that are rotated on the PCA components, and at each record I associated its outcome.
First question: is this rational? or do I have to permute somehow the outcomes vector? or the two things are unrelated?
Then I split Data.dump
in two sets, Data.train
for training and Data.test
for testing, all as.h2o
. The I feed them to a random forest:
rf=h2o.randomForest(training_frame=Data.train,x=1:8,y=9,stopping_rounds=2,
ntrees=200,score_each_iteration=T,seed=1000000)
rf.pred=as.data.table(h2o.predict(rf,Data.test))
What happens is that rf.pred
seems not so similar to the original outcomes Data.test$OUTCOME
. I tried to train a neural network as well, and did not even converge, crashing R.
Second question: is it because I am carrying on some mistake from the PCA treatment? or because I badly set up the random forest? Or I am just dealing with annoying data?
I do not know where to start, as I am new to data science, but the workflow seems correct to me.
Thanks a lot in advance.
The answer to your second question (i.e. "is it the data, or did I do something wrong") is hard to know. This is why you should always try to make a baseline model first, so you have an idea of how learnable the data is.
The baseline could be h2o.glm()
, and/or it could be h2o.randomForest()
, but either way without the PCA step. (You didn't say if you are doing a regression or a classification, i.e. if OUTCOME
is a number or a factor, but both glm and random forest will work either way.)
Going to your first question: yes, it is a reasonable thing to do, and no you don't have to (in fact, should not) involve the outcomes vector.
Another way to answer your first question is: no, it unreasonable. It may be that a random forest can see all the relations itself without needing you to use a PCA. Remember when you use a PCA to reduce the number of input dimensions you are also throwing away a bit of signal, too. You said that the 8 components only capture 95% of the variance. So you are throwing away some signal in return for having fewer inputs, which means you are optimizing for complexity at the expense of prediction quality.
By the way, concatenating the original inputs and your 8 PCA components, is another approach: you might get a better model by giving it this hint about the data. (But you might not, which is why getting some baseline models first is essential, before trying these more exotic ideas.)