Search code examples
pythonamazon-web-servicespysparkaws-glue

Error while doing filter in AWS Glue piepline


I am running a query in AWS GLUE and loading the output into 4 different files using FILTER option as given below:

df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("query", query).option("user", db_username).option("password", db_password).load()
filtered_df0 = df.filter("ORDR_DOC_TYPE='0005'")
filtered_df1 = df.filter("ORDR_DOC_TYPE='0001'")
filtered_df2 = df.filter("ORDR_DOC_TYPE='0003'")
filtered_df3 = df.filter("(ORDR_STTS_CD)='9000' AND (LN_ITEM_REL_QTY)>'0.00' AND (col1)>=(col2)")

I am getting the data into 4 different files using above 4 filters. But the 4th filter condition:

(col1) >= (col2)

is causing some issue. There is no failure as such, but I am getting count mismatch. Then I tried to modify that filer as below:

filtered_df3 = Filter.apply(frame = df, f = lambda x: x["col1"]>=x["col2"])

When I run this, GLUE job is failing due to below error:

TypeError: DataFrame.filter() takes 2 positional arguments but 6 were given

Tried my best to fix it. But no luck. Can someone please help me to fix it??help


Solution

  • Try using the pyspark column function in your filter statement.

    from pyspark.sql.functions import col
    
    filtered_df3 = df.filter((col(ORDR_STTS_CD)== "9000") &
                               (col(LN_ITEM_REL_QTY) > 0.00) &
                               (col(col1) >= col(col2)))
    

    Another thought is ensuring that the columns you are using with the greater than operator (>) are numbers. If they are not numbers, it could be returning some funny results.

    filter.apply is not an option for the pyspark filter command

    More details on pyspark filter: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html