I am doing the simplest example you can find with the iris
dataframe PCA, but I keep on getting the same error from the PCA matrix:
iris_tbl <- copy_to(sc, iris, "iris", overwrite = TRUE)
> pca_model <- tbl(sc, "iris") %>%
+ select(-Species) %>%
+ ml_pca()
> print(pca_model)
Explained variance:
PC1 PC2 PC3 PC4
0.924618723 0.053066483 0.017102610 0.005212184
Rotation:
PC1 PC2 PC3 PC4
Sepal_Length -0.36138659 -0.65658877 0.58202985 0.3154872
Sepal_Width 0.08452251 -0.73016143 -0.59791083 -0.3197231
Petal_Length -0.85667061 0.17337266 -0.07623608 -0.4798390
Petal_Width -0.35828920 0.07548102 -0.54583143 0.7536574
> D <- as.matrix(iris[1:4])
> E <- as.matrix(pca_model$components)
Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), :
'data' must be of a vector type, was 'NULL'
Can someone point out where the mistake is? I can't figure it out. Thank you
The short-ish answer to your problem is that ml_pca
returns a model object not a results object (these are not strictly offical terms). If you inspect pca_model
you will see that (e.g. str(pca_model)
). You can think of pca_model
as being more like the return from lm
than prcomp
for example... Basically, what you need to do is use the model to 'predict' (I put this in quotes as opposed to backticks, because it turns out you cannot use ml_predict
in this case, not sure why) with the same data you trained to get your desired output. For ml_pca_models
there are some convenient wrapper functions tidy
, and augment
then will get you where you need to go. NB: How we are suppose to know that augment means predict and tidy means gather components is beyond me.
Wasn't sure if you wanted the components (i.e. loadings) or the rotations so I am giving you both.
install.packages("Rcpp")
install.packages("sparklyr")
library(sparklyr)
library(dplyr)
sc <- spark_connect(method="databricks") ##change this to for your cluster/spark deployment
iris_tbl <- copy_to(sc, iris, "iris", overwrite = TRUE)
pca_model <- tbl(sc, "iris") %>%
select(-Species) %>%
ml_pca()
print(pca_model)
# Explained variance:
#
# PC1 PC2 PC3 PC4
# 0.924618723 0.053066483 0.017102610 0.005212184
#
# Rotation:
# PC1 PC2 PC3 PC4
# Sepal_Length -0.36138659 -0.65658877 0.58202985 0.3154872
# Sepal_Width 0.08452251 -0.73016143 -0.59791083 -0.3197231
# Petal_Length -0.85667061 0.17337266 -0.07623608 -0.4798390
# Petal_Width -0.35828920 0.07548102 -0.54583143 0.7536574
class(pca_model)
#[1] "ml_model_pca" "ml_model"
str(pca_model)
#List of 8
# $ pipeline_model :List of 5
# ..$ uid : chr "pipeline_9bc1b484009"
# ..$ param_map : Named list()
# ..$ stages :List of 2
# .. ..$ :List of 3
# .. .. ..$ uid : chr "vector_assembler_9bc188edeed"
# .. .. ..$ param_map:List of 3
# .. .. .. ..$ input_cols :List of 4
# .. .. .. .. ..$ : chr "Sepal_Length"
# .. .. .. .. ..$ : chr "Sepal_Width"
# .. .. .. .. ..$ : chr "Petal_Length"
# .. .. .. .. ..$ : chr "Petal_Width"
# .. .. .. ..$ output_col : chr "assembled9bc3ab7e7e1"
# .. .. .. ..$ handle_invalid: chr "error"
# .. .. ..$ .jobj :Classes 'spark_jobj', 'shell_jobj'
# .. .. ..- attr(*, "class")= chr [1:3] "ml_vector_assembler" "ml_transformer" "ml_pipeline_stage"
# .. ..$ :List of 5
# .. .. ..$ uid : chr "pca_9bc60d84696"
loadings <- tidy(pca_model)
loadings
# A tibble: 4 x 5
# features PC1 PC2 PC3 PC4
#
#1 Sepal_Length -0.361 -0.657 0.582 0.315
#2 Sepal_Width 0.0845 -0.730 -0.598 -0.320
#3 Petal_Length -0.857 0.173 -0.0762 -0.480
#4 Petal_Width -0.358 0.0755 -0.546 0.754
rot <- augment(pca_model, iris_tbl) %>% collect() #augment predicts given a model and "new" data.
rot
# A tibble: 150 x 9
# Sepal_Length Sepal_Width Petal_Length Petal_Width Species PC1 PC2 PC3
#
# 1 5.1 3.5 1.4 0.2 setosa -2.82 -5.65 0.660
# 2 4.9 3 1.4 0.2 setosa -2.79 -5.15 0.842
# 3 4.7 3.2 1.3 0.2 setosa -2.61 -5.18 0.614
# 4 4.6 3.1 1.5 0.2 setosa -2.76 -5.01 0.600
# 5 5 3.6 1.4 0.2 setosa -2.77 -5.65 0.542
# 6 5.4 3.9 1.7 0.4 setosa -3.22 -6.07 0.463
# 7 4.6 3.4 1.4 0.3 setosa -2.68 -5.24 0.374
# 8 5 3.4 1.5 0.2 setosa -2.88 -5.49 0.654
# 9 4.4 2.9 1.4 0.2 setosa -2.62 -4.75 0.611
#10 4.9 3.1 1.5 0.1 setosa -2.83 -5.21 0.829
# ... with 140 more rows, and 1 more variable: PC4