apache-spark apache-spark-mllib apache-spark-ml

spark mlib: requirement failed: Index 0 follows 0 and is not strictly increasing

I'm getting the following error when training a logistic regression model using my dataset:

Caused by: java.lang.IllegalArgumentException: requirement failed: Index 0 follows 0 and is not strictly increasing
    at scala.Predef$.require(Predef.scala:281)
    at org.apache.spark.ml.linalg.SparseVector.$anonfun$new$5(Vectors.scala:629)
    at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
    at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:246)
    at org.apache.spark.ml.linalg.SparseVector.<init>(Vectors.scala:628)
    at org.apache.spark.ml.linalg.VectorUDT.deserialize(VectorUDT.scala:64)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:168)
    ... 38 more

I'm not sure what this error indicates and where I should be debugging. Could someone familiar with Spark MLlib give me some guidance? Thanks in advance!

Solution

You're constructing a sparse vector, and the list of (index, value) tuples contains a duplicate 0 index, eg:

Vectors.sparse(2, Seq((0, 1d), (0, 1d)))

Spark used to let this slip, but seemingly doesn't anymore since a recent release.

I had this exact same issue. It turns out to be a useful exception as it highlighted a bug where two of my model's features were using the same prefix in their values, hence the duplicate index.