Search code examples
python-3.xpcah2o

How to use fit and transform for PCA on H2O


I want to use PCA on H2O. In sklearn, we can apply fit on train set and then transform can be applied on test set. Here I am trying to follow the same logic in H2O. In the FAQ, it says:

After the PCA model has been built using h2o.prcomp, use h2o.predict on the original data frame and the PCA model to produce the dimensionality-reduced representation. Use cbind to add the predictor column from the original data frame to the data frame produced by the output of h2o.predict. At this point, you can build supervised learning models on the new data frame.

Based on this I tried the below:

from h2o.transforms.decomposition import H2OPCA

trbb_pca = H2OPCA(k = 5, transform = "NORMALIZE", pca_method="GramSVD",
                   use_all_factor_levels=True, impute_missing=True,seed=24)

trbb_pca.train(x=trbb_cols, training_frame=train_h2o)

train_h2o_pca = train_h2o.cbind(trbb_pca.predict(train_h2o))
test_h2o_pca = test_h2o.cbind(trbb_pca.predict(test_h2o))

Is it the way to implement PCA on train and test set in H2O?


Solution

  • Short answer: Yes. There is an example in the H2O Python booklet, copied here for clarity:

    In [25]: from h2o.transforms.decomposition import H2OPCA
    
    In [26]: pca_decomp = H2OPCA(k=2, transform="NONE", pca_method="Power")
    
    In [27]: pca_decomp.train(x=range(0,4), training_frame=iris_df)
    
    pca Model Build Progress: [#######################################] 100%
    
    In [28]: pca_decomp
    Out[28]: Model Details
    =============
    H2OPCA :  Principal Component Analysis
    Model Key:  PCA_model_python_1446220160417_10
    
    Importance of components:
                            pc1      pc2
    ----------------------  -------  --------
    Standard deviation      7.86058  1.45192
    Proportion of Variance  0.96543  0.032938
    Cumulative Proportion   0.96543  0.998368
    
    ModelMetricsPCA: pca
    
    **
    Reported on train data.
    **
    MSE: NaN
    RMSE: NaN
    
    In [29]: pred = pca_decomp.predict(iris_df)
    
    pca prediction progress: [#######################################] 100%
    
    In [30]: pred.head() # Projection results
    Out[30]:
        PC1      PC2
    -------  -------
    5.9122   2.30344
    5.57208  1.97383
    5.44648  2.09653
    5.43602  1.87168
    5.87507  2.32935
    6.47699  2.32553
    5.51543  2.07156
    5.85042  2.14948
    5.15851  1.77643
    5.64458  1.99191
    

    There are technically two ways to use the PCA estimator in Python. The old method is located here h2o.transforms.decomposition.H2OPCA. A few years ago, we rewrote the Python API and moved some things around, including turning PCA into a proper "H2OEstimator", so now it's also located here: h2o.estimators.pca.H2OPrincipalComponentAnalysisEstimator. Both methods work, though for new code we recommend the new one because it's consistent with the other H2O Estimators.

    The API is the same, so, though not necessary, if you want to, you could switch over to the new one by changing your import statement:

    from h2o.transforms.decomposition import H2OPCA
    

    to:

    from h2o.estimators.pca import H2OPrincipalComponentAnalysisEstimator as H2OPCA