apache-spark k-means apache-spark-mllib sparklyr

sparklyr ml_kmeans Field "features" does not exist

AWS EC2 Spark / Hadoop cluster.

The following baisc K-Means sparklyr code worked with Spark 2.0.1

  kmeans_model <- iris_tbl %>%
  select(Petal_Width, Petal_Length) %>%
  ml_kmeans(centers = 3)

I've upgraded to Spark 2.1.1, and I get this error

    Error: java.lang.IllegalArgumentException: Field "features" does not exist.
        at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
        at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
 ...
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)

I've made some tests with the code

kmeans_model <- iris_tbl %>%
  select(Petal_Width, Petal_Length) %>%
  ml_kmeans(k = 3, features = c("Petal_Length", "Petal_Width"))

kmeans_model <- iris_tbl %>%
  dplyr::select(Petal_Width, Petal_Length) %>%
  ml_kmeans(centers = 3, features = c("Petal_Length", "Petal_Width"))

But still get the same error.

Solution

This code wouldn't work in Spark 2.0, the same way it cannot work in more recent versions. The application of this code is incorrect independent of Spark version. By default ml_kmeans (and other ml_* functions), expect Vector type column named features. features can used to override the name and should be:

a length-one character vector

The only way you could make it work, without using ft_vector_assembler, is to provide formula:

kmeans_model <- iris_tbl %>% 
  ml_kmeans(formula= ~ Petal_Width + Petal_Length, k = 3)