I'm trying to use random forest for a multiclass classification using spark 2.1.1
After defining my pipeline as usual, it's failing during indexing stage.
I have a dataframe with many string type columns. I have created a StringIndexer for each of them.
I am creating a Pipeline by chaining the StringIndexers with VectorAssembler and finally a RandomForestClassifier following by a label converter.
I've checked all my columns with distinct().count()
to make sure I do not have too many categories and so on...
After some debugging, I understand that whenever I started the indexing of some of the columns I get the following errors... When calling:
val indexer = udf { label: String =>
if (labelToIndex.contains(label)) {
labelToIndex(label)
} else {
throw new SparkException(s"Unseen label: $label.")
}
}
Error evaluating methog: 'labelToIndex'
Error evaluating methog: 'labels'
Then inside the transformation, there is this error when defining the metadata:
Error evaluating method: org$apache$spark$ml$feature$StringIndexerModel$$labelToIndex Method threw 'java.lang.NullPointerException' exception. Cannot evaluate org.apache.spark.sql.types.Metadata.toString()
This is happening because I have null on some columns that I'm indexing.
I could reproduce the error with the following example.
val df = spark.createDataFrame(
Seq(("asd2s","1e1e",1.1,0), ("asd2s","1e1e",0.1,0),
(null,"1e3e",1.2,0), ("bd34t","1e1e",5.1,1),
("asd2s","1e3e",0.2,0), ("bd34t","1e2e",4.3,1))
).toDF("x0","x1","x2","x3")
val indexer = new
StringIndexer().setInputCol("x0").setOutputCol("x0idx")
indexer.fit(df).transform(df).show
// java.lang.NullPointerException
The solution present here can be used, and on the Spark 2.2.0, the issue is fixed upstream.