I am facing a problem while getting the cumulative distinct count of resource ids as of different modified dates in vertica. If you see the below table I have resource id, modified date and deleted date and I want to calculate the count of distinct active resources as of all unique modified dates. A resource is considered active when deleted date is null as of/before that modified date.
I was able to get the count when for a particular resource lets say resource id 1 the active count(deleted date null) or inactive count(deleted date not null) dont occur consecutively.
But when they occur consecutively I have to take the count as 1 till it becomes inactive and then I have to consider count as 0 for that resource id when it becomes inactive and all consecutive inactive values till it becomes active again. Likewise for all the distinct resource ids and cumulative sum of those.
sa_resource_id | modified_date | deleted_Date |
---|---|---|
1 | 2022-01-22 15:46:06.758 | |
2 | 2022-01-22 15:46:06.758 | |
16 | 2022-04-22 15:46:06.758 | |
17 | 2022-04-22 15:46:06.758 | |
18 | 2022-04-22 15:46:06.758 | |
16 | 2022-04-29 15:46:06.758 | 2022-04-29 15:46:06.758 |
17 | 2022-04-29 15:46:06.758 | 2022-04-29 15:46:06.758 |
1 | 2022-05-22 15:46:06.758 | 2022-05-22 15:46:06.758 |
2 | 2022-05-22 15:46:06.758 | 2022-05-22 15:46:06.758 |
1 | 2022-05-23 22:16:06.758 | |
1 | 2022-05-24 22:16:06.758 | 2022-05-24 22:16:06.758 |
1 | 2022-05-25 22:16:06.758 | |
1 | 2022-05-27 22:16:06.758 |
This is the partition and sum query I have tried out where I partition the table based on resource ids and do sum over different modified dates.
SELECT md,
dca_agent_count
FROM
(
SELECT modified_date AS md,
SUM(SUM(CASE WHEN deleted_Date IS NULL THEN 1
WHEN deleted_Date IS NOT NULL THEN -1 ELSE 0
END)) OVER (ORDER BY modified_date) AS dca_agent_count
FROM
(
SELECT sa_resource_id,
modified_date,
deleted_Date,
ROW_NUMBER() OVER (
PARTITION BY sa_Resource_id, deleted_Date
ORDER BY modified_date desc
) row_num
FROM mf_Shared_provider_Default.dca_entity_resource_raw
WHERE sa_ResourcE_id IS NOT NULL
AND sa_resource_id IN ('1','2','34','16','17','18')
) t
GROUP BY modified_date
ORDER BY modified_Date
) b
Current Output:
md | dca_agent_count |
---|---|
2022-01-22 15:46:06.758 | 2 |
2022-04-22 15:46:06.758 | 5 |
2022-04-29 15:46:06.758 | 3 |
2022-05-22 15:46:06.758 | 1 |
2022-05-23 22:16:06.758 | 2 |
2022-05-24 22:16:06.758 | 1 |
2022-05-25 22:16:06.758 | 2 |
2022-05-27 22:16:06.758 | 3 |
If you see the output above all the values are correct except for the last row 27-05-2022 where i need to get count 2 only instead of 3
How do I get the cumulative distinct count of sa resource ids as of the modified dates based on deleted date condition(null/not null) and count should not change when deleted date (null/not null) occur consecutively
To me, a DATE
has no hours, minutes, seconds, let alone second fractions, so I renamed the time containing attributes to %_ts
, as they are TIMESTAMP
s.
I had to completely start from scratch to solve it.
I think this is the first problem I had to solve with as much as 5 Common Table Expressions:
is_active
that is never NULLis_active
using LAG()
. NULL here means there is no predecessor for the same resource id.is_active
is equal to the current is_active
.UNION SELECT
the positive COUNT DISTINCT
s of the active rows and the negative COUNT DISTINCT
s of the inactive rows. This also removes the last timestamp.The final query takes CTE 5 and LEFT JOIN
s it with CTE 4, making a running sum of the obtained distinct counts.
Here goes:
WITH
-- not part of the final query: this is your input data
indata(sa_resource_id,modified_ts,deleted_ts) AS (
SELECT 1,TIMESTAMP '2022-01-22 15:46:06.758',NULL
UNION ALL SELECT 2,TIMESTAMP '2022-01-22 15:46:06.758',NULL
UNION ALL SELECT 16,TIMESTAMP '2022-04-22 15:46:06.758',NULL
UNION ALL SELECT 17,TIMESTAMP '2022-04-22 15:46:06.758',NULL
UNION ALL SELECT 18,TIMESTAMP '2022-04-22 15:46:06.758',NULL
UNION ALL SELECT 16,TIMESTAMP '2022-04-29 15:46:06.758',TIMESTAMP '2022-04-29 15:46:06.758'
UNION ALL SELECT 17,TIMESTAMP '2022-04-29 15:46:06.758',TIMESTAMP '2022-04-29 15:46:06.758'
UNION ALL SELECT 1,TIMESTAMP '2022-05-22 15:46:06.758',TIMESTAMP '2022-05-22 15:46:06.758'
UNION ALL SELECT 2,TIMESTAMP '2022-05-22 15:46:06.758',TIMESTAMP '2022-05-22 15:46:06.758'
UNION ALL SELECT 1,TIMESTAMP '2022-05-23 22:16:06.758',NULL
UNION ALL SELECT 1,TIMESTAMP '2022-05-24 22:16:06.758',TIMESTAMP '2022-05-24 22:16:06.758'
UNION ALL SELECT 1,TIMESTAMP '2022-05-25 22:16:06.758',NULL
UNION ALL SELECT 1,TIMESTAMP '2022-05-27 22:16:06.758',NULL
)
-- real query starts here, replace the following comma with "WITH" ...
,
-- need a "active flag" that is never null
w_active_flag AS (
SELECT
*
, (deleted_ts IS NULL) AS is_active
FROM indata
)
,
-- need current and previous is_active to filter ..
w_prev_flag AS (
SELECT
*
, LAG(is_active) OVER w AS prev_flag
FROM w_active_flag
WINDOW w AS(PARTITION BY sa_resource_id ORDER BY modified_ts)
)
,
-- use obtained filter arguments to filter out two consecutive
-- active or non-active rows for same sa_resource_id
-- this can remove timestamps from the final result
de_duped AS (
SELECT
sa_resource_id
, modified_ts
, is_active
FROM w_prev_flag
WHERE prev_flag IS NULL OR prev_flag <> is_active
)
-- get count distinct "sa_resource_id" only now
,
grp AS (
SELECT
modified_ts
, COUNT(DISTINCT sa_resource_id) AS dca_agent_count
FROM de_duped
WHERE is_active
GROUP BY modified_ts
UNION ALL
SELECT
modified_ts
, COUNT(DISTINCT sa_resource_id) * -1 AS dca_agent_count
FROM de_duped
WHERE NOT is_active
GROUP BY modified_ts
)
,
-- get back all input timestamps in a help table
tslist AS (
SELECT DISTINCT
modified_ts
FROM indata
)
SELECT
tslist.modified_ts
, SUM(NVL(dca_agent_count,0)) OVER w AS dca_agent_count
FROM tslist LEFT JOIN grp USING(modified_ts)
WINDOW w AS (ORDER BY tslist.modified_ts);
-- out modified_ts | dca_agent_count
-- out -------------------------+-----------------
-- out 2022-01-22 15:46:06.758 | 2
-- out 2022-04-22 15:46:06.758 | 5
-- out 2022-04-29 15:46:06.758 | 3
-- out 2022-05-22 15:46:06.758 | 1
-- out 2022-05-23 22:16:06.758 | 2
-- out 2022-05-24 22:16:06.758 | 1
-- out 2022-05-25 22:16:06.758 | 2
-- out 2022-05-27 22:16:06.758 | 2