I want to use PCA on H2O. In sklearn, we can apply fit
on train set and then transform
can be applied on test set. Here I am trying to follow the same logic in H2O. In the FAQ, it says:
After the PCA model has been built using h2o.prcomp, use h2o.predict on the original data frame and the PCA model to produce the dimensionality-reduced representation. Use cbind to add the predictor column from the original data frame to the data frame produced by the output of h2o.predict. At this point, you can build supervised learning models on the new data frame.
Based on this I tried the below:
from h2o.transforms.decomposition import H2OPCA
trbb_pca = H2OPCA(k = 5, transform = "NORMALIZE", pca_method="GramSVD",
use_all_factor_levels=True, impute_missing=True,seed=24)
trbb_pca.train(x=trbb_cols, training_frame=train_h2o)
train_h2o_pca = train_h2o.cbind(trbb_pca.predict(train_h2o))
test_h2o_pca = test_h2o.cbind(trbb_pca.predict(test_h2o))
Is it the way to implement PCA on train and test set in H2O?
Short answer: Yes. There is an example in the H2O Python booklet, copied here for clarity:
In [25]: from h2o.transforms.decomposition import H2OPCA
In [26]: pca_decomp = H2OPCA(k=2, transform="NONE", pca_method="Power")
In [27]: pca_decomp.train(x=range(0,4), training_frame=iris_df)
pca Model Build Progress: [#######################################] 100%
In [28]: pca_decomp
Out[28]: Model Details
=============
H2OPCA : Principal Component Analysis
Model Key: PCA_model_python_1446220160417_10
Importance of components:
pc1 pc2
---------------------- ------- --------
Standard deviation 7.86058 1.45192
Proportion of Variance 0.96543 0.032938
Cumulative Proportion 0.96543 0.998368
ModelMetricsPCA: pca
**
Reported on train data.
**
MSE: NaN
RMSE: NaN
In [29]: pred = pca_decomp.predict(iris_df)
pca prediction progress: [#######################################] 100%
In [30]: pred.head() # Projection results
Out[30]:
PC1 PC2
------- -------
5.9122 2.30344
5.57208 1.97383
5.44648 2.09653
5.43602 1.87168
5.87507 2.32935
6.47699 2.32553
5.51543 2.07156
5.85042 2.14948
5.15851 1.77643
5.64458 1.99191
There are technically two ways to use the PCA estimator in Python. The old method is located here h2o.transforms.decomposition.H2OPCA
. A few years ago, we rewrote the Python API and moved some things around, including turning PCA into a proper "H2OEstimator", so now it's also located here: h2o.estimators.pca.H2OPrincipalComponentAnalysisEstimator
. Both methods work, though for new code we recommend the new one because it's consistent with the other H2O Estimators.
The API is the same, so, though not necessary, if you want to, you could switch over to the new one by changing your import statement:
from h2o.transforms.decomposition import H2OPCA
to:
from h2o.estimators.pca import H2OPrincipalComponentAnalysisEstimator as H2OPCA