I am creating a spark dataframe using the below query:
select distinct adr.ProductionDate,adr.CostCenterKey,adr.AEMainCategoryKey, Sum(adr.DurationDayFrac*adr.FixedCashCostAE)*100 / (select sum(adr_sub.DurationDayFrac*adr_sub.FixedCashCostAE)
from hive_metastore.oth_opsdbconf.operations_db___asset_effectiveness_daily_reporting adr_sub
where adr_sub.costcenterkey=adr.costcenterkey and adr_sub.ProductionDate=adr.ProductionDate
) as AELossMagnitude
FROM hive_metastore.oth_opsdbconf.operations_db___asset_effectiveness_daily_reporting adr
WHERE adr.DurationDayFrac != 0 AND adr.ProductionDate <= Current_Date()
AND adr.ProductionDate >= '2017-01-01'
AND adr.SiteName in ('Ludwigshafen','Schwarzheide','Antwerpen')
AND AreaIsCurrent = 'true'
group by adr.ProductionDate,adr.CostCenterKey,adr.AEMainCategoryKey
order by adr.ProductionDate,adr.CostCenterKey,adr.AEMainCategoryKey
But i am getting the below error:
AnalysisException: Correlated scalar subquery 'scalarsubquery(adr.costcenterkey, adr.ProductionDate)' is neither present in the group by, nor in an aggregate function. Add it to group by using ordinal position or wrap it in first() (or first_value) if you don't care which value you get.;
Request you to please help me with the correct syntax.
I am expecting the correct syntax to remove the error.z Its asking for group by in the sub query which is not possible.
First, I don't understand why you need the distinct key word. I think the correlated sub-query will not work. Use a common table expression instead. Please see code below.
--
-- Use common table expression instead of correlated sub query
--
WITH cte_cost_by_center_n_date
(
SELECT
costcenterkey,
ProductionDate,
sum(DurationDayFrac * FixedCashCostAE) as TotalCost
FROM
hive_metastore.oth_opsdbconf.operations_db___asset_effectiveness_daily_reporting
GROUP BY
costcenterkey,
ProductionDate
)
SELECT
adr.ProductionDate,
adr.CostCenterKey,
adr.AEMainCategoryKey,
sum(adr.DurationDayFrac*adr.FixedCashCostAE) * 100 / adr_sub.TotalCost as AELossMagnitude
FROM
hive_metastore.oth_opsdbconf.operations_db___asset_effectiveness_daily_reporting as adr
JOIN
cte_cost_by_center_n_date as adr_sub
ON
adr_sub.costcenterkey = adr.costcenterkey
AND adr_sub.ProductionDate = adr.ProductionDate
WHERE
adr.DurationDayFrac != 0
AND adr.ProductionDate <= Current_Date()
AND adr.ProductionDate >= '2017-01-01'
AND adr.SiteName in ('Ludwigshafen','Schwarzheide','Antwerpen')
AND adr.AreaIsCurrent = 'true'
GROUP BY
adr.ProductionDate,
adr.CostCenterKey,
adr.AEMainCategoryKey
ORDER BY
adr.ProductionDate,
adr.CostCenterKey,
adr.AEMainCategoryKey