scala apache-spark dataframe apache-spark-sql apache-spark-ml

tokenizer in spark dataframe API

Each row of a Spark dataframe df contains a tab-separated string in a column rawFV. I already know that splitting on the tab will yield an array of 3 strings for all the rows. This can be verified by:

df.map(row => row.getAs[String]("rawFV").split("\t").length != 3).filter(identity).count()

and making sure that the count is indeed 0.

My question is: How to do this using the pipeline API?

Here's what I tried:

val tabTok = new RegexTokenizer().setInputCol("rawFV").setOutputCol("tk").setPattern("\t")
val pipeline = new Pipeline().setStages(Array(tabTok))
val transf = pipeline.fit(df)
val df2 = transf.transform(df)
df2.map(row => row.getAs[Seq[String]]("tk").length != 3).filter(identity).count()

which is NOT equal to 0.

The issue has to do with the presence of missing values. For example:

The pipeline code with RegexTokenizer would return 3 fields on the first line but only 2 on the second. On the other hand, the first code would correctly return 3 fields everywhere.

Solution

It is an expected behavior. By default minTokenLength parameter is equal to 1 to avoid empty strings in the output. If you want to return empty strings it should be set to 0.

new RegexTokenizer()
  .setInputCol("rawFV")
  .setOutputCol("tk")
  .setPattern("\t")
  .setMinTokenLength(0)