Search code examples
scalaapache-spark

Using 'where' instead of 'expr' when filtering for values in multiple columns in scala spark


I'm having some trouble refactoring a spark dataframe to not use expr but instead use dataframe filters and when conditionals.

My code is this:

 outDF = outDF.withColumn("MAIN_TYPE", expr
 ("case when 'TYPE_A' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_A'" +
  "when 'TYPE_B' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_B'" +
  "when 'TYPE_C' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_C'" +
  "when 'TYPE_D' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_D' else '0' end")
  .cast(StringType))

The only solution that I could think of, so far is a series of individual .when().otherwise() chains, but that would require mXn lines, where m the number of Types and n the number of Groups that I need.

Is there any better way to do this kind of operation?

Thank you very much for your time!


Solution

  • So, this is how I worked this out, in case anyone is interested:

    I used a helper column for the groups which I later dropped.

    This is how this worked:

        outDF = outDF.withColumn("Helper_Column", concat(col("Group_A"),col("Group_B"),
          col("Group_C"),col("Group_D")))
        outDF = outDF.withColumn("MAIN_TYPE", when(col("Helper_Column").like("%Type_A%"),"Type_A").otherwise(
          when(col("Helper_Column").like("%Type_B%"),"Type_B").otherwise(
            when(col("Helper_Column").like("%Type_C%"),"Type_C").otherwise(
            when(col("Helper_Column").like("%Type_D%"),"Type_D").otherwise(lit("0")
        )))))
        outDF = outDF.drop("Helper_Column")
    

    Hope this helps someone.