Search code examples
scalaapache-spark

How to write a condition based on multiple values for a DataFrame in Spark


I'm working on a Spark Application (using Scala) and I have a List which contains multiple values. I'd like to use this list in order to write a where clause for my DataFrame and select only a subset on tuples. For example, my List contains 'value1', 'value2', and 'value3'. and I would like to write something like this:

mydf.where($"col1" === "value1" || $"col1" === "value2" || $"col1" === "value3)

How can I do that programmatically cause the list contains many values?


Solution

  • You can map a list of values to a list of "filters" (with type Column), and reduce this list into a single filter by applying the || operator on every two filters:

    val possibleValues = Seq("value1", "value2", "value3")
    val result = mydf.where(possibleValues.map($"col1" === _).reduce(_ || _))