I am trying to retrieve data from Athena with the following query:
SELECT DISTINCT cop.shop_id,
cop.product_id,
avg(cop.position) AS avg_position,
cp.kes
FROM data_1 AS cop
JOIN data_2 AS cp
ON cop.product_id = cp.product_id
WHERE cop.site_id = 1
AND cop.product_id IS NOT NULL
GROUP BY cop.shop_id, cop.product_id, cp.kes
However, there are four columns in the data: product_id, shop_id, avg_position, kes.
Some rows come with both NA and non-NA in the kes
column. I simply want to manipulate the data with the following condition:
product_id
, shop_id
, and avg_position
are the same, and if there are NA and Non-NA in the kes
column, just leave the Non-NA row and remove the rows which include NA in the kes
. kes
, don't remove it. How can I do this?
I think that you want aggregation on kes
:
SELECT
cop.shop_id,
cop.product_id,
AVG(cop.position) AS avg_position,
MAX(cp.kes) kes
FROM data_1 AS cop
JOIN data_2 AS cp
ON cop.product_id = cp.product_id
WHERE cop.site_id = 1 AND cop.product_id IS NOT NULL
GROUP BY cop.shop_id, cop.product_id
Aggregate functions ignore null
values; so MAX(cp.kes)
gives you the highest non-null
value of cp.kes
. On the other hand, if all values of cp.kes
are null
within the group, max()
gives null
.
Side note: DISTINCT
and GROUP BY
do not make sense together (although that's still valid SQL); GROUP BY
guarantees no duplicate values in the SELECT
list already.