Search code examples
pythondataframepyspark

Pyspark: filter dataframe based on list with many conditions


Suppose you have a pyspark dataframe df with columns A and B.
Now, you want to filter the dataframe with many conditions.

The conditions are contained in a list of dicts:

l = [{'A': 'val1', 'B': 5}, {'A': 'val4', 'B': 2}, ...]

The filtering should be done as follows:

df.filter(
( (df['A'] == l[0]['A']) & (df['B'] == l[0]['B']) )
&
( (df['A'] == l[1]['A']) & (df['B'] == l[1]['B']) )
&
...
)

How can this be done with l containing many conditions, i.e. a manual insertion into the filter condition is not practical?

I thought about using separate filter steps, i.e.:

for d in l:
    df = df.filter((df['A'] == d['A']) & (df['B'] == d['B']))

Is there a shorter or more elegant way of doing this, e.g. similar to using list comprehensions?
In addition, this does not work for ORs (|).


Solution

  • You could use your list of dictionaries to create a sql expression and send it to your filter all at once.

    l = [{'A': 'val1', 'B': 5}, {'A': 'val4', 'B': 2}]
    df.show()
    
    #+----+---+
    #|   A|  B|
    #+----+---+
    #|val1|  5|
    #|val1|  1|
    #|val1|  3|
    #|val4|  2|
    #|val1|  4|
    #|val1|  1|
    #+----+---+
    
    df.filter(' or '.join(["A"+"="+"'"+d['A']+"'"+" and "+"B"+"="+str(d['B']) for d in l])).show()
    
    #+----+---+
    #|   A|  B|
    #+----+---+
    #|val1|  5|
    #|val4|  2|
    #+----+---+