Search code examples
pythonsqlpandassnowflake-cloud-data-platform

Converting python logic to sql query (Pairing two status from one column)


I need help with converting my python code to SQL:

req_id_mem = ""
req_workflow_mem = ""

collect_state_main = []
collect_state_temp = []

for req_id, req_datetime, req_workflow in zip(df["TICKET_ID"], df["DATETIMESTANDARD"], df["STATUS"]):
    if req_id_mem == "" or req_id_mem != req_id:
        req_id_mem = req_id
        req_workflow_mem = ""
        collect_state_temp = []
    
    if req_workflow_mem == "" and req_workflow == "Open" and req_id_mem == req_id:
        req_workflow_mem = req_workflow
        collect_state_temp.append(req_id)
        collect_state_temp.append(req_workflow)
        collect_state_temp.append(req_datetime)
        
    if req_workflow_mem == "Open" and req_workflow == "Closed" and req_id_mem == req_id:
        req_workflow_mem = req_workflow
        collect_state_temp.append(req_workflow)
        collect_state_temp.append(req_datetime)
        collect_state_main.append(collect_state_temp)
        collect_state_temp = []

DataFrame:

TICKET_ID DATETIMESTANDARD STATUS
79355138 9/3/2024 11:54:18 AM Open
79355138 9/3/2024 9:01:12 PM Open
79355138 9/6/2024 4:52:10 PM Closed
79355138 9/6/2024 4:52:12 PM Open
79355138 9/10/2024 4:01:24 PM Closed
79446344 8/27/2024 1:32:54 PM Open
79446344 9/11/2024 9:40:17 AM Closed
79446344 9/11/2024 9:40:24 AM Closed
79446344 9/11/2024 9:42:14 AM Open

Result:

  1. It will Identify the first Open State of a TICKET_ID and look for the closest Closed Status
  2. It will reiterate for each case to look for an Open and Closed pair (first open and first close will only be considered)

My problem is I'm stuck since the pairings can happen more than twice. I tried Rank in sql but it only return the first instance of pairing but not the other pairs

Adding also my solution to this one as I migrated to snowflake recently:

SELECT
    FOD.TICKET_ID,
    FOD.FIRSTOPENDATETIME AS OPEN_DATETIME,
    MIN(NC.DATETIMESTANDARD) AS CLOSED_DATETIME
FROM
    (
    SELECT
        TICKET_ID,
        MIN(DATETIMESTANDARD) AS FIRSTOPENDATETIME,
        STATUS
    FROM
        DB.TABLE
    WHERE
        (
        (STATUS IN ('Open') AND EVENT_TYPE IN ('Ticket Open'))
        OR STATUS IN ('Closed')
        )
    GROUP BY
        TICKET_ID, STATUS
    ) AS FOD
LEFT JOIN DB.TABLE AS NC ON FOD.TICKET_ID = NC.TICKET_ID AND NC.STATUS = 'Closed' AND NC.DATETIMESTANDARD > FOD.FIRSTOPENDATETIME
WHERE
FOD.STATUS = 'Open'
GROUP BY
FOD.TICKET_ID, FOD.FIRSTOPENDATETIME
ORDER BY
FOD.TICKET_ID ASC, FOD.FIRSTOPENDATETIME ASC

Solution

  • One way of doing it:

    SELECT 
      FOD.TICKET_ID
      , FOD.FIRSTOPENDATETIME
      , (SELECT NC.DATETIMESTANDARD 
         from MyTbl NC -- Nearest future close date
         where NC.TICKET_ID=FOD.TICKET_ID
           and NC.STUS='Closed'
           and NC.DATETIMESTANDARD>FOD.FIRSTOPENDATETIME
           and DATETIMESTANDARD=(select min(DATETIMESTANDARD)
                                 from MyTbl NCb
                                 where NCb.TICKET_ID=NC.TICKET_ID
                                   and NCb.STUS='Closed'
                                   and NCb.DATETIMESTANDARD > FOD.FIRSTOPENDATETIME)
          ) as NearestFutureClosedDate
    from  (select  TICKET_ID
                 , MIN(DATETIMESTANDARD) as FIRSTOPENDATETIME
          from MyTbl
          group by TICKET_ID) as FOD
    

    The main query selects the earliest Open rows for each ticket, and the subquery finds the earliest Closed rows following those.

    Update: I just realised that you only want the date (and no other columns) from the nearest close date row), so it can be simplifed to:

    SELECT 
      FOD.TICKET_ID
      , FOD.FIRSTOPENDATETIME
      , (select min(DATETIMESTANDARD)
         from #MyTbl NC
         where FOD.TICKET_ID=NC.TICKET_ID
         and NC.STUS='Closed'
         and NC.DATETIMESTANDARD > FOD.FIRSTOPENDATETIME
          ) as NearestFutureClosedDate
    from  (select  TICKET_ID
                 , MIN(DATETIMESTANDARD) as FIRSTOPENDATETIME
          from #MyTbl
          group by TICKET_ID) as FOD
    
    ;
    

    You can also use CTEs but it doesn't mean that it will change the execution plan. Just to prove the point, I moved the subquery obtaining 'first open date' into a CTE, and compared the execution plans:

    with FOD as (
         select  TICKET_ID
                 , MIN(DATETIMESTANDARD) as FIRSTOPENDATETIME
          from #MyTbl
          group by TICKET_ID
    )
    SELECT 
      FOD.TICKET_ID
      , FOD.FIRSTOPENDATETIME
      , (select min(DATETIMESTANDARD)
         from #MyTbl NC
         where FOD.TICKET_ID=NC.TICKET_ID
         and NC.STUS='Closed'
         and NC.DATETIMESTANDARD > FOD.FIRSTOPENDATETIME
          ) as NearestFutureClosedDate
    from  FOD
    

    Comparative plans

    These are exactly the same for this example; we can have different outcomes depending on the data statistics, rdbms engine, system resources, etc, but it proves that defining some part of your query as a 'subquery' doesn't instruct the engine to 'create a table'.