Search code examples
sqlsql-serverapache-spark-sqlazure-databricks

Error in Spark sql : AnalysisException: Correlated scalar subquery


I am creating a spark dataframe using the below query:

select distinct  adr.ProductionDate,adr.CostCenterKey,adr.AEMainCategoryKey, Sum(adr.DurationDayFrac*adr.FixedCashCostAE)*100 / (select sum(adr_sub.DurationDayFrac*adr_sub.FixedCashCostAE) 
from hive_metastore.oth_opsdbconf.operations_db___asset_effectiveness_daily_reporting  adr_sub
where adr_sub.costcenterkey=adr.costcenterkey and adr_sub.ProductionDate=adr.ProductionDate
) as AELossMagnitude 
FROM hive_metastore.oth_opsdbconf.operations_db___asset_effectiveness_daily_reporting  adr 
WHERE adr.DurationDayFrac != 0 AND adr.ProductionDate <= Current_Date() 
AND adr.ProductionDate >= '2017-01-01'
AND adr.SiteName in ('Ludwigshafen','Schwarzheide','Antwerpen') 
AND AreaIsCurrent = 'true' 
group by adr.ProductionDate,adr.CostCenterKey,adr.AEMainCategoryKey
order by adr.ProductionDate,adr.CostCenterKey,adr.AEMainCategoryKey

But i am getting the below error:

AnalysisException: Correlated scalar subquery 'scalarsubquery(adr.costcenterkey, adr.ProductionDate)' is neither present in the group by, nor in an aggregate function. Add it to group by using ordinal position or wrap it in first() (or first_value) if you don't care which value you get.;

Request you to please help me with the correct syntax.

I am expecting the correct syntax to remove the error.z Its asking for group by in the sub query which is not possible.


Solution

  • First, I don't understand why you need the distinct key word. I think the correlated sub-query will not work. Use a common table expression instead. Please see code below.

    --
    --  Use common table expression instead of correlated sub query
    --
    
    WITH cte_cost_by_center_n_date
    (
    SELECT
       costcenterkey,
       ProductionDate,
       sum(DurationDayFrac * FixedCashCostAE) as TotalCost
    
    FROM
        hive_metastore.oth_opsdbconf.operations_db___asset_effectiveness_daily_reporting 
    GROUP BY
       costcenterkey,
       ProductionDate
    )
    
    SELECT
        adr.ProductionDate,
        adr.CostCenterKey,
        adr.AEMainCategoryKey, 
        sum(adr.DurationDayFrac*adr.FixedCashCostAE) * 100 /  adr_sub.TotalCost as AELossMagnitude 
    FROM 
        hive_metastore.oth_opsdbconf.operations_db___asset_effectiveness_daily_reporting as adr 
    JOIN
        cte_cost_by_center_n_date as adr_sub
    ON 
        adr_sub.costcenterkey = adr.costcenterkey 
        AND adr_sub.ProductionDate = adr.ProductionDate
    WHERE 
        adr.DurationDayFrac != 0 
        AND adr.ProductionDate <= Current_Date() 
        AND adr.ProductionDate >= '2017-01-01'
        AND adr.SiteName in ('Ludwigshafen','Schwarzheide','Antwerpen') 
        AND adr.AreaIsCurrent = 'true' 
    GROUP BY
        adr.ProductionDate,
        adr.CostCenterKey,
        adr.AEMainCategoryKey
    ORDER BY 
        adr.ProductionDate,
        adr.CostCenterKey,
        adr.AEMainCategoryKey