I am running a query in AWS GLUE and loading the output into 4 different files using FILTER option as given below:
df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("query", query).option("user", db_username).option("password", db_password).load()
filtered_df0 = df.filter("ORDR_DOC_TYPE='0005'")
filtered_df1 = df.filter("ORDR_DOC_TYPE='0001'")
filtered_df2 = df.filter("ORDR_DOC_TYPE='0003'")
filtered_df3 = df.filter("(ORDR_STTS_CD)='9000' AND (LN_ITEM_REL_QTY)>'0.00' AND (col1)>=(col2)")
I am getting the data into 4 different files using above 4 filters. But the 4th filter condition:
(col1) >= (col2)
is causing some issue. There is no failure as such, but I am getting count mismatch. Then I tried to modify that filer as below:
filtered_df3 = Filter.apply(frame = df, f = lambda x: x["col1"]>=x["col2"])
When I run this, GLUE job is failing due to below error:
TypeError: DataFrame.filter() takes 2 positional arguments but 6 were given
Tried my best to fix it. But no luck.
Can someone please help me to fix it??help
Try using the pyspark column function in your filter statement.
from pyspark.sql.functions import col
filtered_df3 = df.filter((col(ORDR_STTS_CD)== "9000") &
(col(LN_ITEM_REL_QTY) > 0.00) &
(col(col1) >= col(col2)))
Another thought is ensuring that the columns you are using with the greater than operator (>
) are numbers. If they are not numbers, it could be returning some funny results.
filter.apply
is not an option for the pyspark filter
command
More details on pyspark filter
: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.filter.html