Search code examples
sqljoinleft-joinimpala

Special LEFT JOIN


I have the following SQL (Impala) pseudo query, since it wont compile this way. The interesting part is the last, where I want to do exactly what you can read.

I want to do a LEFT JOIN but in case there is no matching ProductId I want to use a specific ProductId (which is NULL and suppose there is only one but guaranteeing it by using LIMIT 1) and perform a JOIN-like connection so the above conditions in the CASE-WHEN would work properly.

So basically the question is that if there is a way to transform this syntactically incorrect query to a single correct one?

I was trying different things using e.g. ISNULL() and WITH, but since the subquery which you can see in the ELSE part has to use 2 tables to work properly it can not be compiled anyway it would work in my opinion.

SELECT 
    cd.CycleDataId AS CycleDataId,
    CASE   
        WHEN cd.CycleTime >= op.IdealValue AND cd.CycleTime <= op.MaxValue THEN NVL(dcl.ProductionLossTypeId, -1)
        WHEN cd.CycleTime >= op.IdealValue AND cd.CycleTime >= op.MaxValue THEN dcl.ProductionLossTypeId
    END AS Verdikt,
    CASE   
        WHEN cd.CycleTime >= op.IdealValue AND cd.CycleTime <= op.MaxValue THEN NVL(dcl.Time, cd.CycleTime - op.IdealValue)
        WHEN cd.CycleTime >= op.IdealValue AND cd.CycleTime >= op.MaxValue THEN dcl.Time  
    END AS Time
FROM CycleData cd
LEFT JOIN DistributedCycleLosses dcl ON dcl.CycleDataId = cd.CycleDataId
CASE   
WHEN IF EXISTS(SELECT * FROM Operation_parameter WHERE ProductId = cd.ProductId AND cd.Timestamp_ BETWEEN ValidFrom AND ValidTo) THEN LEFT JOIN Operation_parameter op ON op.ProductId = cd.ProductId AND cd.Timestamp_ BETWEEN op.ValidFrom AND op.ValidTo
ELSE (SELECT * FROM Operation_parameter WHERE ProductId IS NULL AND cd.Timestamp_ BETWEEN ValidFrom AND ValidTo LIMIT 1) AS op
END;

Solution

  • This is basically using a default value. I think this does what you want:

    SELECT cd.CycleDataId AS CycleDataId,
            (CASE WHEN cd.CycleTime >= COALESCE(op.IdealValue, opnull.IdealValue) AND 
                       cd.CycleTime <= COALESCE(op.MaxValue, opnull.MaxValue) 
                  THEN COALESCE(dcl.ProductionLossTypeId, -1)
                  WHEN cd.CycleTime >= COALESCE(op.IdealValue, opnull.IdealValue) AND
                       COALESCE(op.MaxValue, opnull.MaxValue)
                  THEN dcl.ProductionLossTypeId
             END) AS Verdikt,
            (CASE WHEN cd.CycleTime >= COALESCE(op.IdealValue, opnull.IdealValue) AND
                       cd.CycleTime <= COALESCE(op.MaxValue, opnull.MaxValue)
                  THEN cd.CycleTime >= COALESCE(dcl.Time, cd.CycleTime - COALESCE(op.IdealValue, opnull.IdealValue))
                  WHEN cd.CycleTime >= COALESCE(op.IdealValue, opnull.IdealValue) AND
                       cd.CycleTime >= COALESCE(op.MaxValue, opnull.MaxValue)
                  THEN dcl.Time  
             END) AS Time
    FROM CycleData cd LEFT JOIN 
         DistributedCycleLosses dcl
         ON dcl.CycleDataId = cd.CycleDataId LEFT JOIN
         Operation_parameter op
         ON op.ProductId = cd.ProductId AND
            cd.Timestamp_ BETWEEN op.ValidFrom AND op.ValidTo LEFT JOIN
         Operation_parameter opnull
         ON op.ProductId IS NULL AND  -- no previous match
            opnull.ProductID IS NULL AND
            cd.Timestamp_ BETWEEN opnull.ValidFrom AND opnull.ValidTo ;
    

    Note that all references to op are replaced with COALESCE() expressions.

    You can modify this to handle multiple rows for matching NULL values, if that is really necessary. I think the more important part of the logic are the LEFT JOINs.