python apache-spark pyspark apache-spark-sql apache-spark-ml

Pyspark: Filter DF based on Array(String) length, or CountVectorizer count

I have URL data aggregated into a string array. Of this form. [xyz.com,abc.com,efg.com]

I eventually use a count vectorizer in pyspark to get it into a vector like (262144,[3,20,83721],[1.0,1.0,1.0]).

Where the vector is saying out of 262144; there are 3 Urls present indexed at 3,20, and 83721 for a certain row. All this data is binary hence the array of 1's.

I would like to filter and only used rows that contain a certain amount of entries. So if I say wanted to only use rows with 4 examples; I want the above row to be dropped.

I am fine doing this filtering on the String Array of the Vector form returned by Count Vectorizer.

In my data I have rows in tens of millions and am just not sure how to do it efficiently.

Here is code from the docs edited to produce an example

from pyspark.ml.feature import CountVectorizer

# Input data: Each row is a bag of words with a ID.
df = spark.createDataFrame([
    (0, "a".split(" ")),
    (1, "a b c".split(" "))
], ["id", "words"])

# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features",         vocabSize=3)

model = cv.fit(df)

result = model.transform(df)

So here say we wanted only results that are of 2 length or higher. Again I do not mind doing this on the vector produced by CountVectorizer or the String array before that as long as it is efficient with the size of my data.

Solution

You can just use DataFrame.filter with the size function:

from pyspark.sql.functions import size

df.filter(size('words') > 2).show()

+---+---------+
| id|    words|
+---+---------+
|  1|[a, b, c]|
+---+---------+

I would do it before CountVectorizer to avoid having it do work that doesn't need to be done. Spark will pull filter operations earlier in the execution plan if it can determine it safe to do so, but being explicit is always better.