Search code examples
sqljsonsnowflake-cloud-data-platformlagflatten

How to Get First_Value(), Last_Value() and previous Date action for an Array object inside a VARIANT column SnowflakeSQL


I have a VARIANT column call 'REQUEST' in the table 'QWERTY' that contains an Array object inside a JSON like

{
"ID": "123123",
"workflowHistory": [
                   {
                    "id": "666",
                    "workflowType": "CCC",
                    "entityId": "123123",
                    "creator": {
                        "id": "503081",
                        "displayName": "AGENT2",
                        "email": "AGENT2@SOMETHING.com",
                        "userAvatarUrl": "XXXXXXX"
                    },
                    "createdDate": "2020-04-30T21:58:09Z",
                    "deletor": null,
                    "deletedDate": null,
                    "clientId": "000000000",
                    "value": "00000000"
                },
                {
                    "id": "555",
                    "workflowType": "AAA",
                    "entityId": "123123",
                    "creator": {
                        "id": "503080",
                        "displayName": "AGENT1",
                        "email": "AGENT1@SOMETHING.com",
                        "userAvatarUrl": "XXXXXXX"
                    },
                    "createdDate": "2020-04-30T21:55:09Z",
                    "deletor": null,
                    "deletedDate": null,
                    "clientId": "000000000",
                    "value": "00000000"
                },
                {
                   "id": "444",
                    "workflowType": "xyz",
                    "entityId": "123123",
                    "creator": {
                        "id": "503080",
                        "displayName": "AGENT1",
                        "email": "AGENT1@SOMETHING.com",
                        "userAvatarUrl": "XXXXXXX"
                    },
                    "createdDate": "2020-04-30T21:19:09Z",
                    "deletor": null,
                    "deletedDate": null,
                    "clientId": "000000000",
                    "value": "00000000"
                },
                {
                   "id": "333",
                    "workflowType": "BBB",
                    "entityId": "123123",
                    "creator": {
                        "id": "503079",
                        "displayName": "AGENT0",
                        "email": "AGENT0@SOMETHING.com",
                        "userAvatarUrl": "XXXXXXX"
                    },
                    "createdDate": "2020-04-30T21:10:09Z",
                    "deletor": null,
                    "deletedDate": null,
                    "clientId": "000000000",
                    "value": "00000000"
                },
                {
                   "id": "222",
                    "workflowType": "ZZZ",
                    "entityId": "123123",
                    "creator": {
                        "id": "503079",
                        "displayName": "AGENT0",
                        "email": "AGENT0@SOMETHING.com",
                        "userAvatarUrl": "XXXXXXX"
                    },
                    "createdDate": "2020-04-30T21:08:09Z",
                    "deletor": null,
                    "deletedDate": null,
                    "clientId": "000000000",
                    "value": "00000000"
                }
                    ]
}

Also, 'QWERTY' table has HAVERST_DATE and the PK ARTICLE_ID (the same as REQUEST:workflowHistory.ID), I am trying to get an output with the following columns:

  1. ID
  2. Last createdDate for an AGENTn
  3. First createdDate for an AGENTn
  4. the previous createdDate that is made BY AGENTn-1
  5. the next createdDate that is made BY AGENTn+1

I would like an output like:

OUTPUT

For this I'm building A query as follows:

WITH WorkFlow_Parsed AS(

SELECT ARTICLE_ID,
       HARVEST_DATE,
       value:createdDate::timestamp_tz  AS create_date,                
       value:creator:email AS email,
       value:workflowType AS  workflowType,
       value:value AS value
      
FROM 'QWERTY', lateral flatten( input => REQUEST:workflowHistory )
),


lag_Agent_timing AS 
(SELECT
WorkFlow_Parsed.ARTICLE_ID AS ARTICLE_ID,WorkFlow_Parsed.email,LAG(WorkFlow_Parsed.create_date) IGNORE NULLS over (partition by  WorkFlow_Parsed.email,WorkFlow_Parsed.ARTICLE_ID order by WorkFlow_Parsed.create_date) AS lag_date_value
FROM  WorkFlow_Parsed),

lead_agent_timing AS
(SELECT
WorkFlow_Parsed.ARTICLE_ID AS ARTICLE_ID,WorkFlow_Parsed.email,LEAD(WorkFlow_Parsed.create_date) IGNORE NULLS over (partition by WorkFlow_Parsed.email,WorkFlow_Parsed.ARTICLE_ID order by WorkFlow_Parsed.create_date)  AS lead_date_value
FROM  WorkFlow_Parsed)


SELECT 
DISTINCT 
WorkFlow_Parsed.ARTICLE_ID AS _ARTICLE_ID,
WorkFlow_Parsed.email AS _email,
last_value(WorkFlow_Parsed.create_date) over (partition by WorkFlow_Parsed.email,WorkFlow_Parsed.ARTICLE_ID order by WorkFlow_Parsed.create_date) AS last_date_value,
first_value(WorkFlow_Parsed.create_date) over (partition by WorkFlow_Parsed.email,WorkFlow_Parsed.ARTICLE_ID order by WorkFlow_Parsed.create_date) AS first_date_value,
MAX(lag_Agent_timing.lag_date_value),
MIN(lead_agent_timing.lead_date_value)
FROM  WorkFlow_Parsed
JOIN lag_Agent_timing ON WorkFlow_Parsed.ARTICLE_ID=lag_Agent_timing.ARTICLE_ID AND lag_Agent_timing.email=WorkFlow_Parsed.email
JOIN lead_agent_timing ON WorkFlow_Parsed.ARTICLE_ID=lead_agent_timing.ARTICLE_ID AND lead_agent_timing.email=WorkFlow_Parsed.email  
GROUP BY _ARTICLE_ID,_email

But I Got the error: "[SYS_VW.CREATE_DATE_1] is not a valid group by expression"`

How could I Fix it?


Solution

  • [SYS_VW.CREATE_DATE_1] is not a valid group by expression

    The error is coming from your use of GROUP BY in the final SELECT query. It is pointing out that you are referencing/using Workflow_Parsed.create_date in the query as a non-group column but it isn't part of the GROUP BY _ARTICLE_ID, _email expression, i.e. it is the same as [Workflow_Parsed.create_date] is not a valid group by expression that you will receive if you simplify the query a bit.

    Snowflake does not permit aggregating over a window function expression and if you'd like to mix a GROUP BY with a window function, try nesting the query in an structure such as SELECT cols, aggregate(cols) FROM (SELECT cols, window(cols)) GROUP BY cols to separate the two (i.e. apply window functions over all rows first, then group the entire result it produces).

    I'm unsure what the window functions are attempting in your sample query because I do not see the agent's n ± 1 relations anywhere in them, but going by your described requirement and the sample output included, the following should work (it just uses scalar subqueries, no window functions):

    WITH workflows AS (
      SELECT PARSE_JSON('{"ID":"123123","workflowHistory":[{"id":"666","workflowType":"CCC","entityId":"123123","creator":{"id":"503081","displayName":"AGENT2","email":"AGENT2@SOMETHING.com","userAvatarUrl":"XXXXXXX"},"createdDate":"2020-04-30T21:58:09Z","deletor":null,"deletedDate":null,"clientId":"000000000","value":"00000000"},{"id":"555","workflowType":"AAA","entityId":"123123","creator":{"id":"503080","displayName":"AGENT1","email":"AGENT1@SOMETHING.com","userAvatarUrl":"XXXXXXX"},"createdDate":"2020-04-30T21:55:09Z","deletor":null,"deletedDate":null,"clientId":"000000000","value":"00000000"},{"id":"444","workflowType":"xyz","entityId":"123123","creator":{"id":"503080","displayName":"AGENT1","email":"AGENT1@SOMETHING.com","userAvatarUrl":"XXXXXXX"},"createdDate":"2020-04-30T21:19:09Z","deletor":null,"deletedDate":null,"clientId":"000000000","value":"00000000"},{"id":"333","workflowType":"BBB","entityId":"123123","creator":{"id":"503079","displayName":"AGENT0","email":"AGENT0@SOMETHING.com","userAvatarUrl":"XXXXXXX"},"createdDate":"2020-04-30T21:10:09Z","deletor":null,"deletedDate":null,"clientId":"000000000","value":"00000000"},{"id":"222","workflowType":"ZZZ","entityId":"123123","creator":{"id":"503079","displayName":"AGENT0","email":"AGENT0@SOMETHING.com","userAvatarUrl":"XXXXXXX"},"createdDate":"2020-04-30T21:08:09Z","deletor":null,"deletedDate":null,"clientId":"000000000","value":"00000000"}]}') AS request
    ), workflow_rows AS (
      SELECT
        w.request:ID::varchar AS article_id,        
        lf.value:createdDate::timestamp_tz  AS created_date,
        lf.value:creator.id::integer AS creator_id,
        lf.value:creator.email::varchar AS creator_email,
        lf.value:workflowType::varchar AS workflow_type,
        lf.value:value::varchar AS workflow_value
      FROM workflows w, LATERAL FLATTEN(REQUEST:workflowHistory) lf
    ), article_workflow_creators AS (
      SELECT DISTINCT
        article_id,
        creator_id,
        creator_email
      FROM workflow_rows
    )
    SELECT
        awc.article_id,
        awc.creator_id,
        awc.creator_email,
        (SELECT MAX(wr.created_date) FROM workflow_rows wr WHERE wr.article_id = awc.article_id AND wr.creator_id = awc.creator_id) AS last_date_value,
        (SELECT MIN(wr.created_date) FROM workflow_rows wr WHERE wr.article_id = awc.article_id AND wr.creator_id = awc.creator_id) AS first_date_value,
        (SELECT MAX(wr.created_date) FROM workflow_rows wr WHERE wr.article_id = awc.article_id AND wr.creator_id = awc.creator_id - 1) AS previous_date,
        (SELECT MAX(wr.created_date) FROM workflow_rows wr WHERE wr.article_id = awc.article_id AND wr.creator_id = awc.creator_id + 1) AS next_date
    FROM article_workflow_creators awc;
    

    For the single JSON row input included in the question, this produces:

    +------------+------------+----------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+
    | ARTICLE_ID | CREATOR_ID | CREATOR_EMAIL        | LAST_DATE_VALUE               | FIRST_DATE_VALUE              | PREVIOUS_DATE                 | NEXT_DATE                     |
    |------------+------------+----------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------|
    | 123123     |     503081 | AGENT2@SOMETHING.com | 2020-04-30 21:58:09.000 +0000 | 2020-04-30 21:58:09.000 +0000 | 2020-04-30 21:55:09.000 +0000 | NULL                          |
    | 123123     |     503080 | AGENT1@SOMETHING.com | 2020-04-30 21:55:09.000 +0000 | 2020-04-30 21:19:09.000 +0000 | 2020-04-30 21:10:09.000 +0000 | 2020-04-30 21:58:09.000 +0000 |
    | 123123     |     503079 | AGENT0@SOMETHING.com | 2020-04-30 21:10:09.000 +0000 | 2020-04-30 21:08:09.000 +0000 | NULL                          | 2020-04-30 21:55:09.000 +0000 |
    +------------+------------+----------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+