Search code examples
sqlt-sqlhaproxyreferersql-server-2016

How to use SQL to attribute events to visit source in traffic logs?


I'm trying to attribute events to source in a traffic log. This seems like fairly basic territory for any semi-savvy technical marketer or webmaster, but I've Googled around a bit, and surprisingly, nobody seems to have covered this.

Background: I'm using T-SQL and we're running on SQL Server 2016. My event logs (HAProxy) look something like this, where RefererHost is '%mysite%' for internal referrals, but can be anything else for external referrals (entrances).

User Agent ||     IP     ||  RefererHost ||     Event     ||   CreationDate
------------------------------------------------------------------------------
qwertyuiop || 99.99.99.9 ||   google.com ||   Home/View   || 2015-05-29 00:00:25
------------------------------------------------------------------------------
qwertyuiop || 99.99.99.9 ||   mysite/x   ||   Home/View   || 2015-05-29 00:00:27
------------------------------------------------------------------------------
abcdefghij || 11.11.11.1 ||   yahoo.com  ||   Home/View   || 2015-05-29 00:00:49
------------------------------------------------------------------------------
qwertyuiop || 99.99.99.9 ||   mysite/y   ||     Submit    || 2015-05-29 00:01:28
------------------------------------------------------------------------------
abcdefghij || 11.11.11.1 ||   mysite/p   ||   Photo/View  || 2015-05-29 00:02:04
------------------------------------------------------------------------------
abcdefghij || 11.11.11.1 ||   mysite/n   ||     Submit    || 2015-05-29 00:02:09

The goal: I'm trying to associate all Submit events with their entrance. Aggregated, the results would look something like this:

RefererHost || SubmitCount  ||
------------------------------
google.com  ||      1       || 
yahoo.com   ||      1       || 

Complicating factors: But this is a very simple example. In fact, individual users can visit multiple times per period, and they can Submit more than once per visit (session). Also, users can go idle for long periods of time after entrance: a Submit can occur hours after an entrance.

So I think what I'm trying to do is select for all CreationDates where Event = Submit, and Users (IP + UA checksum), and then find the nearest previous Event where = RefererHost is not '%mysite%', and store it... somewhere associated with that Submit event. Then I can do a count of Submit events, grouped by RefererHost to get what I'm looking for.

This method makes some sense to me, but I don't know how to write a query that "looks back" to find the nearest previous referer. Also, I'm not sure if SQL alone can handle this without the operation timing out. And I'm not sure if I'm missing an edge case. Has anyone done anything like this before?


Solution

  • If you're using a database with window functions, you can do this with a reasonably short query. You can also see a working example of this query (with some dummy data), if you'd like to tinker with this on live data: https://modeanalytics.com/benn/reports/9f72b24dce58/query

    Each step in this is broken out as a common table expression. While this makes it easier to describe, the query could be written as series of subqueries if that style's more your thing.

    Step 1: I made your table.

    WITH event_table AS (
        SELECT user_id AS dummy_ip,
               occurred_at,
               location AS dummy_referer,
               event_name
          FROM tutorial.playbook_events 
    )
    

    The example data I had didn't map exactly to your example, but this creates a table that roughly does. I mapped user_id to ip_address since those two fields are conceptually the same. location and referer have absolutely nothing to do with each other, but they're both event attributes associated with every event. And I had a location field in my data, so I went with it. Think of it like a physical referer or something, I guess.

    Step 2: Determine the time since the last event.

    with_last_event AS (
        SELECT *,
               LAG(occurred_at,1) OVER (PARTITION BY dummy_ip ORDER BY occurred_at) AS last_event
          FROM event_table
    )
    

    The LAG function here finds the time of the last event at that IP. If there was no last event, it's null.

    Step 3: Find which events mark the beginning of a new session.

    with_new_session_flag AS (
        SELECT *,
               CASE WHEN EXTRACT('EPOCH' FROM occurred_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10) OR last_event IS NULL 
                    THEN 1 ELSE 0 END AS is_new_session,
               CASE WHEN EXTRACT('EPOCH' FROM occurred_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10) OR last_event IS NULL 
                    THEN dummy_referer ELSE NULL END AS first_referer
          FROM with_last_event
    )
    

    Most platforms define new sessions as an action after a period of inactivity. The first case statement does that by looking for how long it's been since the previous event. If it's longer than the time you choose (in this case, 60 seconds * 10, so 10 minutes), then that event is flagged as the first one in a new session. It's flagged with a 1; non-first events are marked with a 0.

    The second case statement finds the same event, but rather than marking that event with a 1 to flag it as a new session, it returns the referer. If it's not a new session, it returns null.

    Step 4: Create session ids.

    with_session_ids AS (
        SELECT *,
               SUM(is_new_session) OVER (ORDER BY dummy_ip, occurred_at) AS global_session_id,
               SUM(is_new_session) OVER (PARTITION BY dummy_ip ORDER BY occurred_at) AS user_session_id
          FROM with_new_session_flag
    )
    

    These window functions produce a running total of the session flags (the column that's 1 when it's a new session and 0 when it's not). The result is a column that stays the same when a session doesn't change, and increments by 1 every time a new session starts. Depending on how you partition and order this window function, you can create sessions ids that are unique to that user and unique globally.

    Step 5: Find the original session referer.

    with_session_referer AS (
        SELECT *,
               MAX(first_referer) OVER (PARTITION BY global_session_id) AS session_referer
          FROM with_session_ids
    )
    

    This final window function looks for the MAX value of the first_referer for that global_session_id. Since that column was made to be null for every value other than the first event of that session, this will return the first_referer of that session for every event in that session.

    Step 6: Count some stuff.

    SELECT session_referer,
           COUNT(1) AS total_events,
           COUNT(DISTINCT global_session_id) AS distinct_sessions,
           COUNT(DISTINCT dummy_ip) AS distinct_ips
      FROM with_session_referer
     WHERE event_name = 'send_message'
     GROUP BY 1
    

    This last step is straightforward - filter your events to only the event you care about (Submit, in your example). Then count the number of events by session_referer, which is the first referer of the session in which that event occurred. By counting global_session_id and dummy_ip, you can also find how sessions had that event, and how many distinct IPs logged that event.