Search code examples
sqlapache-spark-sqlsubquerysql-in

Spark: Equivalent to not in


I have a where clause in Spark SQL that for some reason doesn't return any records. I think it doesn't work and so I would like to ask what would be equivalent to it?

SELECT
  c.client_id,
  current_date() as insert_date
FROM
  CLIENT_SUB c
WHERE
  (c.client_id, insert_date) not in (SELECT client_id, insert_date from CLIENT_SUBSCRIBER_CONTRACT)

I heard I could do it with a join


Solution

  • I would recommend not exists: it is null-safe, while not it isn't - and it usually scales better, too.

    I am also suspiscious about the reference to insert_date: do you really mean that, or you actually want current_date()?

    select cs.client_id, current_date() as insert_date
    from client_sub cs
    where not exists (
        select 1 
        from client_subscriber_contract csc
        where 
            csc.client_id = c.client_id 
            and csc.insert_date = cs.insert_date
            -- or, maybe: csc.insert_date = current_date()
    )
    

    For performance, consider an index on client_subscriber_contract(client_id, insert_date).