In my Spark ML Pipeline (Spark 2.3.0) I use RegexTokenizer
like this:
val regexTokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("words")
.setMinTokenLength(3)
It transforms DataFrame
to the one with arrays of words, for example:
text | words
-------------------------
a the | [the]
a of to | []
big small | [big,small]
How to filter rows with empty []
arrays?
Should I create custom transformer and pass it to pipeline?
You can use SQLTransformer
:
import org.apache.spark.ml.feature.SQLTransformer
val emptyRemover = new SQLTransformer().setStatement(
"SELECT * FROM __THIS__ WHERE size(words) > 0"
)
which can applied directly
val df = Seq(
("a the", Seq("the")), ("a of the", Seq()),
("big small", Seq("big", "small"))
).toDF("text", "words")
emptyRemover.transform(df).show
+---------+------------+
| text| words|
+---------+------------+
| a the| [the]|
|big small|[big, small]|
+---------+------------+
or used in a Pipeline
.
Nonetheless I would consider twice before using this in Spark ML process. Tools normally used downstream, like CountVectorizer
, can handle empty input just fine:
import org.apache.spark.ml.feature.CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
+---------+------------+-------------------+
| text| words| features|
+---------+------------+-------------------+
| a the| [the]| (3,[2],[1.0])|
| a of the| []| (3,[],[])|
|big small|[big, small]|(3,[0,1],[1.0,1.0])|
+---------+------------+-------------------+
and lack of presence of certain words, can often provide useful information.