How to use fit and transform for PCA on H2O

I want to use PCA on H2O. In sklearn, we can apply fit on train set and then transform can be applied on test set. Here I am trying to follow the same logic in H2O. In the FAQ, it says:

After the PCA model has been built using h2o.prcomp, use h2o.predict on the original data frame and the PCA model to produce the dimensionality-reduced representation. Use cbind to add the predictor column from the original data frame to the data frame produced by the output of h2o.predict. At this point, you can build supervised learning models on the new data frame.

Based on this I tried the below:

from h2o.transforms.decomposition import H2OPCA

trbb_pca = H2OPCA(k = 5, transform = "NORMALIZE", pca_method="GramSVD",
                   use_all_factor_levels=True, impute_missing=True,seed=24)

trbb_pca.train(x=trbb_cols, training_frame=train_h2o)

train_h2o_pca = train_h2o.cbind(trbb_pca.predict(train_h2o))
test_h2o_pca = test_h2o.cbind(trbb_pca.predict(test_h2o))

Is it the way to implement PCA on train and test set in H2O?

Solution

Short answer: Yes. There is an example in the H2O Python booklet, copied here for clarity:

In [25]: from h2o.transforms.decomposition import H2OPCA

In [26]: pca_decomp = H2OPCA(k=2, transform="NONE", pca_method="Power")

In [27]: pca_decomp.train(x=range(0,4), training_frame=iris_df)

pca Model Build Progress: [#######################################] 100%

In [28]: pca_decomp
Out[28]: Model Details
=============
H2OPCA :  Principal Component Analysis
Model Key:  PCA_model_python_1446220160417_10

Importance of components:
                        pc1      pc2
----------------------  -------  --------
Standard deviation      7.86058  1.45192
Proportion of Variance  0.96543  0.032938
Cumulative Proportion   0.96543  0.998368

ModelMetricsPCA: pca

**
Reported on train data.
**
MSE: NaN
RMSE: NaN

In [29]: pred = pca_decomp.predict(iris_df)

pca prediction progress: [#######################################] 100%

In [30]: pred.head() # Projection results
Out[30]:
    PC1      PC2
-------  -------
5.9122   2.30344
5.57208  1.97383
5.44648  2.09653
5.43602  1.87168
5.87507  2.32935
6.47699  2.32553
5.51543  2.07156
5.85042  2.14948
5.15851  1.77643
5.64458  1.99191

There are technically two ways to use the PCA estimator in Python. The old method is located here h2o.transforms.decomposition.H2OPCA. A few years ago, we rewrote the Python API and moved some things around, including turning PCA into a proper "H2OEstimator", so now it's also located here: h2o.estimators.pca.H2OPrincipalComponentAnalysisEstimator. Both methods work, though for new code we recommend the new one because it's consistent with the other H2O Estimators.

The API is the same, so, though not necessary, if you want to, you could switch over to the new one by changing your import statement:

from h2o.transforms.decomposition import H2OPCA

to:

from h2o.estimators.pca import H2OPrincipalComponentAnalysisEstimator as H2OPCA