Search code examples
sqldata-analysis

How to get a conditional calculation that I want with SQL


I have a table with more than 500k rows and columns like below;

  user_id | event_date |  event_name  | version|
===============================================
  1543435 | 18092021   |  first_open  |  113    
  5476523 | 18092021   | session_start|  111 
  7418529 | 18092021   |  first_open  |  112 
  1754821 | 18092021   | first_open   |  113 
  9820011 | 18092021   | session_start|  114 
  4568714 | 18092021   | session_start|  120

Users that have Event_name with first_open means that the user installed and opened the application for the first time, while session_start means the user installed and opened before, it's not the first time.

user_id is unique for each user and does not change for each login.

We are dealing with users that only have the version 113.

I need to find the proportion of users who installed the application on the 18th (18.09.2021) and opened it again on the 1st (19.09.2021) and 3rd (21.09.2021) days.

Roadmap:

  • Step 1: Finding unique user_ids with event date=18092021 and event_name=first_open and obtaining their count
  • Step 2: Finding the count of matches with event date=19092021 and event_name=session_start, which also match the user_ids from step 1
  • Step 3: Finding the count of matches with event date=21092021 and event_name=session_start, which also match the user_ids from step 1
  • Step 4: Dividing the count number in step 2 and the count number in step 1 will give us Day 1 retention
  • Step 5: Dividing the count number in step 3 and the count number in step 1 will give us Day 3 retention

After a week of researching and brainstorming, i have written the query below:

SELECT 
(SELECT COUNT(DISTINCT our_data.user_id)
FROM our_data WHERE our_data.event_date = '20210918' 
AND our_data.event_name ='first_open' 
AND our_data.version = '113') 
AS DAY_ZERO,

(SELECT COUNT(DISTINCT dayone.user_id) 
FROM our_data
LEFT JOIN our_data AS dayone 
ON our_data.user_id = dayone.user_id)
WHERE our_data.event_date = '20210918' 
AND dayone.event_date = our_data.event_date +1
AND our_data.event_name ='first_open' 
AND dayone.event_name ='session_start' 
AND our_data.version = '113' 
AND dayone.version = '113')
AS DAY_ONE,

(SELECT COUNT(DISTINCT our_data.user_id)
FROM our_data
LEFT JOIN our_data as daythree
ON our_data.user_id = daythree.user_id 
WHERE our_data.event_date = '20210918' 
AND daythree.event_date = our_data.event_date +3
AND our_data.event_name ='first_open' 
AND daythree.event_name ='session_start' 
AND our_data.version = '113' 
AND daythree.version = '113')
AS DAY_THREE

This query gave me these result:

  DAY_ZERO | DAY_ONE |  DAY_THREE |
========================================
  14879    |   7850  |     949    |   

Among these results, I could not perform any operation in the same query. I need to reach Day_ONE/DAY_ZERO= DAY 1 RETENTION and DAY_THREE/DAY_ZERO= DAY3 RETENTION. I need to do these operations for other first days in the same table, so I have to do it in a single query. How do you think I can do?


Solution

  • I would have posted this in a comment, but the query seems a bit too long for a comment. In short, the best answer would depend a bit on the SQL flavour you actually use (T-SQL? PL/SQL? PL/pgSQL?), but the generic approach would be the same.

    You have calculated the required values in the the subqueries already. I have re-used those subqueries, but instead of the row values, made those into tables with unique columns names.

    Now you have three tables, each having one row. Cross join those to get a "single row" with three different columns... and do the desired calculations on the columns

    select 
      DAY_ONE.cnt1/DAY_ZERO.cnt0   as DAY_1_RETENTION,
      DAY_THREE.cnt3/DAY_ZERO.cnt0 as DAY_3_RETENTION
    FROM (SELECT COUNT(DISTINCT our_data.user_id) AS cnt0
          FROM our_data WHERE our_data.event_date = '20210918' 
          AND our_data.event_name ='first_open' 
          AND our_data.version = '113') 
         AS DAY_ZERO,
    
         (SELECT COUNT(DISTINCT dayone.user_id) cnt1
          FROM our_data
          LEFT JOIN our_data AS dayone 
          ON our_data.user_id = dayone.user_id)
          WHERE our_data.event_date = '20210918' 
          AND dayone.event_date = our_data.event_date +1
          AND our_data.event_name ='first_open' 
          AND dayone.event_name ='session_start' 
          AND our_data.version = '113' 
          AND dayone.version = '113')
         AS DAY_ONE,
    
         (SELECT COUNT(DISTINCT our_data.user_id) cnt3
          FROM our_data
          LEFT JOIN our_data as daythree
          ON our_data.user_id = daythree.user_id 
          WHERE our_data.event_date = '20210918' 
          AND daythree.event_date = our_data.event_date +3
          AND our_data.event_name ='first_open' 
          AND daythree.event_name ='session_start' 
          AND our_data.version = '113' 
          AND daythree.version = '113')
          AS DAY_THREE