Search code examples
sqlgroup-bysplunksplunk-query

How do I group rows by time unless there is a significant gap?


Problem domain: I (well, the business) have a wifi network that accepts public connections. We want to know how long each device is staying connected to each of the access points (AP). This is known as 'dwell time'. The problem is complicated as a device can, and usually does, move around between APs during the day and will often come back to many of them more than once.

We currently use Splunk as our data capture and reporting tool and it does this auto-magically but we are considering a move to AWS and so will need to re-form everything using a combination of ETL and SQL.

I have data that looks like the below:

rowID clientMAC apMAC timeSeen
 100      1       a   12:01
 101      1       a   12:03
 102      1       a   12:05
 103      1       b   12:10
 104      1       b   12:20
 105      2       a   12:20
 106      2       a   12:22
 107      1       a   13:00
 108      1       a   13:02
 109      1       a   13:06
 110      1       a   13:12

My challenge is to report the duration of each example of clientAP+macAP, so for example, how long was clientMAC=1 connected to apMAC=a.

I can't take the final timeSeen from the initial timeSeen as clientMAC=1 connects to apMAC=b in the middle, so the result would include the time of that connection too.

The plain english logic of what I need to do is:

For each grouping of clientMAC and apMAC, determine the connection duration during the chosen time period. If there is a gap of, say 15 minutes between rows that are of the same combination, start a new duration calculation and close off the old one. Essentially, each set of a given clientMAC being seen at a given apMAC should be a seperate 'transaction' and reported as a single line.

So the desired output is something like:

clientMAC apMAC Duration
    1      a      ...
    1      b      ...
    2      a      ...
    1      a      ...

Solution

  • A version that doesn't use LAG(), and thus will work on older versions of SQL (LAG is SQL Server 2012 onwards), just in case. I have a lot of clients still using SQL Server 2008, and frequently need solutions that will work on older version, so someone else might need the same too!

    This example includes creating some test data, so you can see it working and the results

    -- Create a temp table to hold the test data
    CREATE TABLE #TestData
    (
        rowID INT NOT NULL PRIMARY KEY,
        clientMAC INT NOT NULL,
        apMAC VARCHAR(1) NOT NULL,
        timeSeen DATETIME NOT NULL
    )
    
    -- Create some test data
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (100, 1, 'a', '2019-Nov-01 12:01:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (101, 1, 'a', '2019-Nov-01 12:02:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (102, 1, 'a', '2019-Nov-01 12:05:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (103, 1, 'b', '2019-Nov-01 12:10:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (104, 1, 'b', '2019-Nov-01 12:20:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (105, 2, 'a', '2019-Nov-01 12:20:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (106, 2, 'a', '2019-Nov-01 12:22:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (107, 1, 'a', '2019-Nov-01 13:00:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (108, 1, 'a', '2019-Nov-01 13:02:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (109, 1, 'a', '2019-Nov-01 13:06:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (110, 1, 'a', '2019-Nov-01 13:12:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (111, 1, 'a', '2019-Nov-01 14:00:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (112, 1, 'a', '2019-Nov-01 14:12:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (113, 1, 'a', '2019-Nov-01 14:14:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (114, 1, 'a', '2019-Nov-01 14:30:00')
    INSERT INTO #TestData (rowID, clientMAC, apMAC, timeSeen) VALUES (115, 1, 'a', '2019-Nov-01 14:35:00')
    
    -- Start of Actual Code
    
    -- Store our maximum allowed gap in minuutes into a variable
    DECLARE @MaximumGapMinutes INT = 15
    
    -- Create temp table for primary calculated data
    CREATE TABLE #DwellTimes (
        rowID INT NOT NULL PRIMARY KEY,
        clientMAC INT NOT NULL,
        apMAC VARCHAR(1) NOT NULL,
        timeSeen DATETIME NOT NULL,
        lastSeen DATETIME NOT NULL,
        DwellTime INT NOT NULL
    )
    
    -- Populate temp table
    INSERT INTO #DwellTimes
    SELECT *, DateDiff(MINUTE, lastSeen, timeSeen) AS DwellTime
    FROM (
        SELECT *, IsNull((SELECT TOP 1 timeSeen 
                          FROM #TestData TDInner 
                          WHERE TDInner.clientMac = TDMain.clientMac AND TDInner.apMac = TDMain.apMac
                            AND TDInner.timeSeen < TDMain.timeSeen
                          ORDER BY timeSeen DESC
                        ), timeSeen) AS lastSeen
        FROM #TestData TDMain
    ) InnerTable
    
    -- Calculate the Dwell Time for visits, counting gaps longer than @MaximumGapMinutes as a new visit
    SELECT Min(timeSeen) AS StartTime, clientMac, apMac, 
           SUM(CASE WHEN DwellTime > @MaximumGapMinutes THEN 0 ELSE DwellTime END) AS DwellTime
    FROM (
    SELECT *, (SELECT COUNT(*) 
               FROM #DwellTimes DSub 
               WHERE DSub.clientMac = DMain.clientMac AND DSub.apMac = DMain.apMac
                 AND DSub.timeSeen <= DMain.timeSeen AND DSub.DwellTime > 15) AS GapNumber
    FROM #DwellTimes DMain
    ) InnerTable
    GROUP BY clientMac, apMac, GapNumber
    ORDER BY StartTime, clientMAC, apMAC, DwellTime
    
    -- Clean up after ourselves
    DROP TABLE #DwellTimes
    
    -- End of Actual Code
    
    -- Clean up after ourselves
    DROP TABLE #TestData
    

    Results:-

    Results

    An explanation of how it works.

    In the actual code itself, not the test data prep, the first thing we do is declare a variable for our maximum gap beween times to still be considered part of the same visit

    -- Store our maximum allowed gap in minuutes into a variable
    DECLARE @MaximumGapMinutes INT = 15
    

    We then create a temporary table to hold Dwell Time calculations, and populate it

    -- Populate temp table
    INSERT INTO #DwellTimes
    SELECT *, DateDiff(MINUTE, lastSeen, timeSeen) AS DwellTime
    FROM (
        SELECT *, IsNull((SELECT TOP 1 timeSeen 
                          FROM #TestData TDInner 
                          WHERE TDInner.clientMac = TDMain.clientMac AND TDInner.apMac = TDMain.apMac
                            AND TDInner.timeSeen < TDMain.timeSeen
                          ORDER BY timeSeen DESC
                        ), timeSeen) AS lastSeen
        FROM #TestData TDMain
    ) InnerTable
    

    The inner select finds the previous timeSeen for that clientMac and apMac. If there is no previous time seen, then it uses the current timeSeen (the IsNull(subselect, timeSeen) section).

    The outer select then calculates the Dwell Time between the current timeSeen and the previous one (lastSeen) for the same clientMac and apMac. Because we use the current timeSeen as lastSeen if there is no previous visit, the Dwell Time will be zero if this is the first visit.

    Results stored in #DwellTimes

    Reults stored in #DwellTimes

    Finally, we calculate the actual visits and Dwell Times, taking a gap longer than our maximum as a new visit.

    -- Calculate the Dwell Time for visits, counting gaps longer than @MaximumGapMinutes as a new visit
    SELECT Min(timeSeen) AS StartTime, clientMac, apMac, 
           SUM(CASE WHEN DwellTime > @MaximumGapMinutes THEN 0 ELSE DwellTime END) AS DwellTime
    FROM (
    SELECT *, (SELECT COUNT(*) 
               FROM #DwellTimes DSub 
               WHERE DSub.clientMac = DMain.clientMac AND DSub.apMac = DMain.apMac
                 AND DSub.timeSeen <= DMain.timeSeen AND DSub.DwellTime > 15) AS GapNumber
    FROM #DwellTimes DMain
    ) InnerTable
    GROUP BY clientMac, apMac, GapNumber
    ORDER BY StartTime, clientMAC, apMAC, DwellTime
    

    The inner select here adds a field GapNumber, which we can use as a grouping field. This is simply a count of how many previous records exceeded our maximum, including the current record. Thus if the current record exceeds the maximum, it is the start of a new visit.

    Results with Gap Number

    Gap Number

    And finally, grouping by clientMac, apMac and GapNumber allows us to use the sum of DwellTime as the Dwell Time per Visit, provided we set GapNumber to 0 if it's larger than our maximum, as that will be the start of the visit

    SUM(CASE WHEN DwellTime > @MaximumGapMinutes THEN 0 ELSE DwellTime END) AS DwellTime
    

    Results

    Hope this is useful to someone!