python amazon-web-services pyspark aws-glue

Error while doing filter in AWS Glue piepline

I am running a query in AWS GLUE and loading the output into 4 different files using FILTER option as given below:

df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("query", query).option("user", db_username).option("password", db_password).load()
filtered_df0 = df.filter("ORDR_DOC_TYPE='0005'")
filtered_df1 = df.filter("ORDR_DOC_TYPE='0001'")
filtered_df2 = df.filter("ORDR_DOC_TYPE='0003'")
filtered_df3 = df.filter("(ORDR_STTS_CD)='9000' AND (LN_ITEM_REL_QTY)>'0.00' AND (col1)>=(col2)")

I am getting the data into 4 different files using above 4 filters. But the 4th filter condition:

(col1) >= (col2)

is causing some issue. There is no failure as such, but I am getting count mismatch. Then I tried to modify that filer as below:

filtered_df3 = Filter.apply(frame = df, f = lambda x: x["col1"]>=x["col2"])

When I run this, GLUE job is failing due to below error:

TypeError: DataFrame.filter() takes 2 positional arguments but 6 were given

Tried my best to fix it. But no luck. Can someone please help me to fix it??help

Solution

Try using the pyspark column function in your filter statement.

from pyspark.sql.functions import col

filtered_df3 = df.filter((col(ORDR_STTS_CD)== "9000") &
                           (col(LN_ITEM_REL_QTY) > 0.00) &
                           (col(col1) >= col(col2)))

Another thought is ensuring that the columns you are using with the greater than operator (>) are numbers. If they are not numbers, it could be returning some funny results.

filter.apply is not an option for the pyspark filter command

More details on pyspark filter: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html