Search code examples
sqlpysparkapache-zeppelin

How to select IDs that have at least two specific instaces in a given column


I'm working with a medical claim table in pyspark and I want to return only userid's that have at least 2 claim_ids. My table looks something like this:

claim_id |  userid |  diagnosis_type |  claim_type
__________________________________________________
1            1            C100            M
2            1            C100a           M
3            2            D50             F
5            3            G200            M
6            3            C100            M
7            4            C100a           M
8            4            D50             F
9            4            A25             F

From this example, I would want to return userid's 1, 3, and 4 only. Currently I'm building a temp table to count all of the distinct instances of the claim_ids

create table temp.claim_count as
select distinct userid, count(distinct claim_id) as claims
from medical_claims
group by userid

and then pulling from this table when the number of claim_id >1

select distinct userid
from medical_claims
where userid (
    select distinct userid
    from temp.claim_count
    where claims>1)

Is there a better / more efficient way of doing this?


Solution

  • If you want only the ids, then use group by:

    select userid, count(*) as claims
    from medical_claims
    group by userid
    having count(*) > 1;
    

    If you want the original rows, then use window functions:

    select mc.*
    from (select mc.*, count(*) over (partition by userid) as num_claims
          from medical_claims mc
         ) mc
    where num_claims > 1;