Search code examples
sqlapache-sparkpysparkapache-spark-sql

How to aggregate in Spark SQL


I have a dataset like

user_id  |  value
1111        NULL
1111        active
2222        active  

I want to group by and get the first available value for each user so I do

select 
    user_id,
    COLLECT_LIST(value) [0]
from table; 

which doesn't work as for 11111 the value NULL is shown.

I try

select 
    user_id,
    COALESCE(value)
from table; 

But that returns an array not a string. Is there a function in spark SQL which can help me with this


Solution

  • SELECT 
        user_id,
        first(value, true) AS first_value
    FROM 
        table
    GROUP BY 
        user_id;
    

    The first(value, true) function takes two parameters: the column to analyze (value) and a boolean indicating whether to ignore null values (true). This should work as intended, returning the first available non-null entry for each user. If all entries for a user are null, it returns null.