Search code examples
python-3.xpysparkbigdatarowapache-spark-sql

pyspark sql: how to count the row with mutiple conditions


I have a dataframe like this after some operations;

df_new_1 = df_old.filter(df_old["col1"] >= df_old["col2"])
df_new_2 = df_old.filter(df_old["col1"] < df_old["col2"])

print(df_new_1.count(), df_new_2.count())
>> 10, 15

I can find the number of rows individually like above by calling count(). But how can I do this using pyspark sql row operation. i.e aggregating by row. I want to see the result like this;

Row(check1=10, check2=15)

Solution

  • Since you tagged pyspark-sql, you can do the following:

    df_old.createOrReplaceTempView("df_table")
    
    spark.sql("""
    
        SELECT sum(int(col1 >= col2)) as check1
        ,      sum(int(col1 < col2)) as check2
        FROM df_table
    
    """).collect()
    

    Or use the API functions:

    from pyspark.sql.functions import expr
    
    df_old.agg(
        expr("sum(int(col1 >= col2)) as check1"), 
        expr("sum(int(col1 < col2)) as check2")
    ).collect()