Search code examples
mysqlsqlsequencedata-cleaningdata-processing

Collapsing SQL rows into one row based on match of sequential row pattern


I'm working on a project that involves combining a series of actions into sequences. The specific programming question is that I want to figure out how to use SQL (preferably mySQL) to combine multiple rows in a table when these rows match a certain sequence. One caveat is that the table is grouped by two other columns (who did it and what day they did it).

For this example, we are working with a list of actions for people's morning routines. There are two sequences of actions that we want to combine our action list into: wakeup/snoozed alarm and started day. A wakeup/snoozed alarm sequence is a (wakeup row) -> (snooze alarm) row. A started day sequence is a (wakeup row) -> (ate breakfast row) -> (brushed teeth row).

Our raw actions table looks like this: A series of actions for different participants on a different day

Our desired output looks like this: A similar action table who's rows matching a certain sequence have been collapsed into one 'sequence' row.

So far, I've looked into window functions and iteration using mySQL, but both of those don't seem to scale to the fact that I have multiple sequences to match on.

I feel like this might not be possible in SQL, but I thought that I'd post this here in case someone else would know how to tackle this data processing question. The end goal is to have these results stored in a view so whenever we query the actions table in the future, it queries the "processed" actions (ie actions after they've been grouped into sequences).


Solution

  • If I understand correctly, you can use lead() and then some filtering logic. First, assign the new actions:

    select t.*,
           (case when (actionid, next_actionid) = ('wokeUp', 'snoozedAlarm')
                 then 'wokeup and snoozed'
                 when (actionid, next_actionid, next2_actionid) = ('wokeUp', 'ateBreakfast', 'brushedTeeth')
                 then 'started day'
                 else actionid
             end) as action,
    from (select t.*,
                 lead(actionId, 1) over (partition by person, day order by id) as next_actionid,
                 lead(actionId, 2) over (partition by person, day order by id) as next2_actionid
          from t
         ) t;
    

    Next, use this information for filtering:

    with newactions as (
          select t.*,
                 (case when (actionid, next_actionid) = ('wokeUp', 'snoozedAlarm')
                       then 'wokeup and snoozed'
                       when (actionid, next_actionid, next2_actionid) = ('wokeUp', 'ateBreakfast', 'brushedTeeth')
                       then 'started day'
                       else actionid
                   end) as action,
                 (case when (actionid, next_actionid) = ('wokeUp', 'snoozedAlarm')
                       then 2
                       when (actionid, next_actionid, next2_actionid) = ('wokeUp', 'ateBreakfast', 'brushedTeeth')
                       then 1
                       else 0
                   end) as duration
          from (select t.*,
                       lead(actionId, 1) over (partition by person, day order by id) as next_actionid,
                       lead(actionId, 2) over (partition by person, day order by id) as next2_actionid
                from t
               ) t
          )
        select na.*
        from (select na.*,
                     lag(duration) over (partition by person order by id) as prev_duration,
                     lag(duration) over (partition by person order by id) as prev2_duration
              from newactions na
             ) na
        where not (prev_duration >= 1 or
                   prev2_duration >= 2
                  )