Search code examples
pyspark

How to access dataframe column in pyspark and do string comparision?


I have a python function which return True/False depends on value of a data frame column.

def check_name(df):
  if ((df.name == "ABC")):
      return ((df.Value < 0.80))

    return (df.Value == 0)

And I pass this function into my query as myFunction:

def myQuery(myFunction):
    df.filter(...).groupBy(...).withColumn('Result', when(myFunction(df), 0).otherwise(1))

But it fails

Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

I think the problem is this df.name == "ABC"

I have tried changing to F.col('name') == "ABC", but I get the same error.

Can you please tell me how to fix my issue?


Solution

  • if-else code should be instructions (when.otherwise) in spark.

    def check_name(df):
        return F.when(df.id == "ABC", df.score1 < 0.80).otherwise(df.score1 == 0)
    

    and then if myFunction must return boolean and you are inverting the boolean value (true = 0, false = 1), you can simplify the myQuery to be

    def myQuery(myFunction):
        return (df.filter(...)
                .groupBy(...)
                .withColumn('Result', (~myFunction(df).cast('int')))