I have a dataframe like this after some operations;
df_new_1 = df_old.filter(df_old["col1"] >= df_old["col2"])
df_new_2 = df_old.filter(df_old["col1"] < df_old["col2"])
print(df_new_1.count(), df_new_2.count())
>> 10, 15
I can find the number of rows individually like above by calling count()
. But how can I do this using pyspark sql
row
operation. i.e aggregating by row. I want to see the result like this;
Row(check1=10, check2=15)
Since you tagged pyspark-sql, you can do the following:
df_old.createOrReplaceTempView("df_table")
spark.sql("""
SELECT sum(int(col1 >= col2)) as check1
, sum(int(col1 < col2)) as check2
FROM df_table
""").collect()
Or use the API functions:
from pyspark.sql.functions import expr
df_old.agg(
expr("sum(int(col1 >= col2)) as check1"),
expr("sum(int(col1 < col2)) as check2")
).collect()