Search code examples
pythonapache-sparkpysparkoperators

PySpark using OR operator in filter


I have an array that I am indexing to filter for data from cities of California.

This filter works: raw_df_2 = raw_df_1.filter(array_contains(col("country.state.city"), 'San Diego'))

However, when I expand to include other cities:

raw_df_2 = raw_df_1.filter(array_contains(col("country.state.city"), 'San Diego') || array_contains(col("country.state.city"), 'Sacramento') || array_contains(col("country.state.city"), 'Los Angeles'))

I get SyntaxError: invalid syntax

I have also tried

raw_df_2 = raw_df_1.filter(array_contains(col("country.state.city"), 'San Diego' || 'Sacramento' || 'Los Angeles'))

but this also returns SyntaxError: invalid syntax

What is the correct usage of the OR operator in Spark to filter data from Californian cities?


Solution

  • Logical OR uses a single vertical bar (|).

    raw_df_2 = raw_df_1.filter(array_contains(col("country.state.city"), 'San Diego') | array_contains(col("country.state.city"), 'Sacramento') | array_contains(col("country.state.city"), 'Los Angeles'))