Each row of a Spark dataframe df
contains a tab-separated string in a column rawFV
. I already know that splitting on the tab will yield an array of 3 strings
for all the rows. This can be verified by:
df.map(row => row.getAs[String]("rawFV").split("\t").length != 3).filter(identity).count()
and making sure that the count is indeed 0
.
My question is: How to do this using the pipeline API?
Here's what I tried:
val tabTok = new RegexTokenizer().setInputCol("rawFV").setOutputCol("tk").setPattern("\t")
val pipeline = new Pipeline().setStages(Array(tabTok))
val transf = pipeline.fit(df)
val df2 = transf.transform(df)
df2.map(row => row.getAs[Seq[String]]("tk").length != 3).filter(identity).count()
which is NOT equal to 0
.
The issue has to do with the presence of missing values. For example:
The pipeline code with RegexTokenizer
would return 3 fields on the first line but only 2 on the second. On the other hand, the first code would correctly return 3 fields everywhere.
It is an expected behavior. By default minTokenLength
parameter is equal to 1 to avoid empty strings in the output. If you want to return empty strings it should be set to 0.
new RegexTokenizer()
.setInputCol("rawFV")
.setOutputCol("tk")
.setPattern("\t")
.setMinTokenLength(0)