AWS EC2 Spark / Hadoop cluster.
The following baisc K-Means sparklyr code worked with Spark 2.0.1
kmeans_model <- iris_tbl %>%
select(Petal_Width, Petal_Length) %>%
ml_kmeans(centers = 3)
I've upgraded to Spark 2.1.1, and I get this error
Error: java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
...
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
I've made some tests with the code
kmeans_model <- iris_tbl %>%
select(Petal_Width, Petal_Length) %>%
ml_kmeans(k = 3, features = c("Petal_Length", "Petal_Width"))
or
kmeans_model <- iris_tbl %>%
dplyr::select(Petal_Width, Petal_Length) %>%
ml_kmeans(centers = 3, features = c("Petal_Length", "Petal_Width"))
But still get the same error.
This code wouldn't work in Spark 2.0, the same way it cannot work in more recent versions. The application of this code is incorrect independent of Spark version. By default ml_kmeans
(and other ml_*
functions), expect Vector
type column named features
. features
can used to override the name and should be:
a length-one character vector
The only way you could make it work, without using ft_vector_assembler
, is to provide formula
:
kmeans_model <- iris_tbl %>%
ml_kmeans(formula= ~ Petal_Width + Petal_Length, k = 3)