Search code examples
pythonarraysfilterpysparkcollect

Filter on the basis of multiple strings in a pyspark array column


df = sqlContext.createDataFrame(pd.DataFrame([('id1', ['a','b','c']),
                                              ('id2', ['a','b','d']),
                                              ('id3', ['c','e','f'])], 
                                              columns=['id', 'items']))

df.show()

from pyspark.sql.functions import udf, col, when
from pyspark.sql.types import ArrayType, StringType, IntegerType

filter_array_udf = udf(lambda arr: [1 if (x =='a' and x=='b') else 0 for x in arr], "int")

df2 = df.withColumn("listed1", filter_array_udf(col("items")))
df2.show(20,0)

df2.show()

I am trying to get the row flagged if a certain id contains 'a' or 'b' string. for which the udf returns null value. I am fairly new to udfs. What do i have to change in the given udf to get the desired result

df.filter(F.array_contains(F.col('items'),'a')).show()

This works for just a single string but if I pass ['a', 'b'] in array. It throws error

Unsupported literal type class java.util.ArrayList [a, b]

Solution

  • Use array_intersect to check the elements are in the array column.

    from pyspark.sql import functions as f
    
    df.withColumn('temp', f.array_intersect(f.col('items'), f.array(f.lit('a'), f.lit('b')))) \
      .withColumn('listed1', f.expr('if(temp != array(), true, false)')) \
      .show(10, False)
    
    +---+---------+------+-------+
    |id |items    |temp  |listed1|
    +---+---------+------+-------+
    |id1|[a, b, c]|[a, b]|true   |
    |id2|[a, b, d]|[a, b]|true   |
    |id3|[c, e, f]|[]    |false  |
    +---+---------+------+-------+