Search code examples
sqlapache-spark-sqlwhere-clausedatabricks

Invalid subquery: Scalar subquery must return only one column


When i run this query in data bricks i get the following error

[INVALID_SUBQUERY_EXPRESSION.SCALAR_SUBQUERY_RETURN_MORE_THAN_ONE_OUTPUT_COLUMN] Invalid subquery: Scalar subquery must return only one column, but got 3. SQLSTATE: 42823

`select 
    all.a,
    all.b,
    all.c
from    com a11
    join    fam_grp a12
      on    ((a11.mkt_id || a11.brd_id) = (a12.mkt_id || a12.brd_id) and 
    a11.dataset = a12.dataset)
where   ((a11.cust_id,
    a11.mkt_id,
    a11.o_cust_id)
 in (((select   pc21.cust_id,
        pc21.mkt_id_1,
        pc21.o_cust_id
    from    ZZMD00  pc21
    group by    pc21.cust_id,
        pc21.mkt_id_1,
        pc21.o_cust_id)
 intersect (select  pc21.cust_id,
        pc21.mkt_id_1,
        pc21.o_cust_id
    from    ZZMD01  pc21
    group by    pc21.cust_id,
        pc21.mkt_id_1,
        pc21.o_cust_id)))
 and a11.dataset in ('C'))
group by    
    all.a,
    all.b,
    all.c`

I believe i get the issue after the where clause using a sub query, i guess that is causing the error.


Solution

  • You can't use row-comparators in your DBMS, so IN with multiple columns doesn't work.

    You need to use EXISTS with either INTERSECT or an inner WHERE filter.

    Also, given that you are using an EXISTS or IN, using GROUP BY in the subquery doesn't make much sense anyway.

    select 
        all.a,
        all.b,
        all.c
    from com a11
    join fam_grp a12
      on a11.mkt_id = a12.mkt_id
     and a11.brd_id = a12.brd_id
     and a11.dataset = a12.dataset
    where exists (
        select
            a11.cust_id,
            a11.mkt_id,
            a11.o_cust_id
    
        intersect
    
        select
            pc21.cust_id,
            pc21.mkt_id_1,
            pc21.o_cust_id
        from ZZMD00 pc21
    
        intersect
        
        select
            pc21.cust_id,
            pc21.mkt_id_1,
            pc21.o_cust_id
        from ZZMD01 pc21
      )
      and a11.dataset in ('C')
    group by    
        all.a,
        all.b,
        all.c;
    

    Alternatively

    select 
        all.a,
        all.b,
        all.c
    from com a11
    join fam_grp a12
      on a11.mkt_id = a12.mkt_id
     and a11.brd_id = a12.brd_id
     and a11.dataset = a12.dataset
    where exists (select 1
        from ZZMD00 pc21
        where a11.cust_id = pc21.cust_id
          and a11.mkt_id = pc21.mkt_id
          and a11.o_cust_id = pc21.o_cust_id
    
        intersect
        
        select
            pc21.cust_id,
            pc21.mkt_id_1,
            pc21.o_cust_id
        from ZZMD01 pc21
      )
      and a11.dataset in ('C')
    group by    
        all.a,
        all.b,
        all.c;
    

    You can also do this as separate where predicates.

    select 
        all.a,
        all.b,
        all.c
    from com a11
    join fam_grp a12
      on a11.mkt_id = a12.mkt_id
     and a11.brd_id = a12.brd_id
     and a11.dataset = a12.dataset
    where exists (select 1
        from ZZMD00 pc21
        where a11.cust_id = pc21.cust_id
          and a11.mkt_id = pc21.mkt_id
          and a11.o_cust_id = pc21.o_cust_id
    )
      and exists (select 1
        from ZZMD01 pc21
        where a11.cust_id = pc21.cust_id
          and a11.mkt_id = pc21.mkt_id
          and a11.o_cust_id = pc21.o_cust_id
    )
      and a11.dataset in ('C')
    group by    
        all.a,
        all.b,
        all.c;
    

    I'm not sure why you are concatting mkt_id and brd_id, but it's unlikely to be correct or performant.