Search code examples
scalaapache-sparkapache-spark-sqlapache-spark-mllibapache-spark-ml

Apache Spark ML Pipeline: filter empty rows in dataset


In my Spark ML Pipeline (Spark 2.3.0) I use RegexTokenizer like this:

val regexTokenizer = new RegexTokenizer()
      .setInputCol("text")
      .setOutputCol("words")
      .setMinTokenLength(3)

It transforms DataFrame to the one with arrays of words, for example:

text      | words
-------------------------
a the     | [the]
a of to   | []
big small | [big,small]

How to filter rows with empty [] arrays? Should I create custom transformer and pass it to pipeline?


Solution

  • You can use SQLTransformer:

    import org.apache.spark.ml.feature.SQLTransformer
    
    val emptyRemover = new SQLTransformer().setStatement(
      "SELECT * FROM __THIS__ WHERE size(words) > 0"
    )
    

    which can applied directly

    val df = Seq(
      ("a the", Seq("the")), ("a of the", Seq()), 
      ("big small", Seq("big", "small"))
    ).toDF("text", "words")
    
    emptyRemover.transform(df).show
    
    +---------+------------+
    |     text|       words|
    +---------+------------+
    |    a the|       [the]|
    |big small|[big, small]|
    +---------+------------+
    

    or used in a Pipeline.

    Nonetheless I would consider twice before using this in Spark ML process. Tools normally used downstream, like CountVectorizer, can handle empty input just fine:

    import org.apache.spark.ml.feature.CountVectorizer
    
    val vectorizer = new CountVectorizer()
      .setInputCol("words")
      .setOutputCol("features")
    
    +---------+------------+-------------------+                 
    |     text|       words|           features|
    +---------+------------+-------------------+
    |    a the|       [the]|      (3,[2],[1.0])|
    | a of the|          []|          (3,[],[])|
    |big small|[big, small]|(3,[0,1],[1.0,1.0])|
    +---------+------------+-------------------+
    

    and lack of presence of certain words, can often provide useful information.