Search code examples
apache-sparkpysparkapache-spark-sqlconditional-statementstypeerror

'Column' object is not callable" - ge le gt lt methods


I have a dataframe and I'm trying to filter based on end_date if it's >= or < a certain date.

However, I'm getting a "not callable" error.

line 148, in <module>
    df_s1 = df_x.filter(df_x[\"end_date\"].ge(lit(\"2022-08-17\")))
TypeError: 'Column' object is not callable"

Here is my code:

df_x = df_x.join(df_di_meet, trim(df_x.application_id) == trim(df_di_meet.application_id), "left")\
             .select (df_x["*"], df_di_meet["end_date"])
             
# ... Cast end_date to timestamp ...end_date format looks like 2013-12-20 23:59:00.0000000
df_x = df_x.withColumn("end_date",(col("end_date").cast("timestamp")))

# ... Here df_s1 >= 2022-08-17
df_s1 = df_x.filter(df_x["end_date"].ge(lit("2022-08-17")))

#... Here df_s2 < 2022-08-17
df_s2 = df_x.filter(df_x["end_date"].lt(lit("2022-08-17")))

What I'm trying to do is check additional logic as well like the code below, but since it's not working with a when clause I decided to break down the dataframes and check each one separately. Is there an easier way, or how could I get the below code to work?

df_x = df_x.withColumn("REV_STAT_TYP_DES", when((df_x.review_statmnt_type_desc == lit("")) & (df_x("end_date").ge(lit("2022-08-17"))), "Not Released")
                                           when((df_x.review_statmnt_type_desc == lit("")) & ((df_x("end_date").lt(lit("2022-08-17"))) | (df_x.end_date == lit(""))), "Not Available")
                                           .otherwise(None))

Solution

  • There are attempts to make difficult code look cleaner. According to those recommendations, conditional statements may be better understood and maintained if they were separated into different variables. Look at how I've added isnull to some of the variables - it would have been a lot more difficult if they were not refactored into separate variables.

    from pyspark.sql import functions as F
    
    no_review = (F.col("review_statmnt_type_desc") == "") | F.isnull("review_statmnt_type_desc")
    no_end_date = (F.col("end_date") == "") | F.isnull("end_date")
    not_released = no_review & (F.col("end_date") >= F.lit("2022-08-17"))
    not_available = no_review & ((F.col("end_date") < F.lit("2022-08-17")) | no_end_date)
    

    Also, you don't need the otherwise clause if it returns null (its the default behaviour).

    df_x = df_x.withColumn(
        "REV_STAT_TYP_DES",
        F.when(not_released, "Not Released")
         .when(not_available, "Not Available")
    )