Search code examples
sqlwindow-functions

Calculating user retention on daily basis between the dates in SQL


I have a table that has the data about user_ids, all their last log_in dates to the app

Table:

|----------|--------------|
| User_Id  | log_in_dates |
|----------|--------------|
|   1      |  2021-09-01  |
|   1      |  2021-09-03  |
|   2      |  2021-09-02  |
|   2      |  2021-09-04  |
|   3      |  2021-09-01  |
|   3      |  2021-09-02  |
|   3      |  2021-09-03  |
|   3      |  2021-09-04  |
|   4      |  2021-09-03  |
|   4      |  2021-09-04  |
|   5      |  2021-09-01  |
|   6      |  2021-09-01  |
|   6      |  2021-09-09  |
|----------|--------------|

From the above table, I'm trying to understand the user's log in behavior from the present day to the past 90 days.

Num_users_no_log_in defines the count for the number of users who haven't logged in to the app from present_day to the previous days (last_log_in_date)

I want the table like below:

|---------------|------------------|--------------------|-------------------------|
| present_date  | days_difference  | last_log_in_date   |  Num_users_no_log_in    |
|---------------|------------------|--------------------|-------------------------|
|  2021-09-01   |       0          |    2021-09-01      |         0               |
|  2021-09-02   |       1          |    2021-09-01      |         3               |->(Id = 1,5,6)
|  2021-09-02   |       0          |    2021-09-02      |         3               |->(Id = 1,5,6)
|  2021-09-03   |       2          |    2021-09-01      |         2               |->(Id = 5,6)  
|  2021-09-03   |       1          |    2021-09-02      |         1               |->(Id = 2)    
|  2021-09-03   |       0          |    2021-09-03      |         3               |->(Id = 2,5,6)
|  2021-09-04   |       3          |    2021-09-01      |         2               |->(Id = 5,6)  
|  2021-09-04   |       2          |    2021-09-02      |         0               |
|  2021-09-04   |       1          |    2021-09-03      |         1               |->(Id= 1)     
|  2021-09-04   |       0          |    2021-09-04      |         3               |->(Id = 1,5,6)
|    ....       |       ....       |        ....        |         ....  
|---------------|------------------|--------------------|-------------------------|

I was able to get the first three columns Present_date | days_difference | last_log_in_date using the following query:

with dts as
(
    select distinct log_in from users_table
)
select x.log_in_dates as present_date,
       DATEDIFF(DAY, y.log_in_dates ,x.log_in_dates ) as Days_since_last_log_in,
       y.log_in_dates as log_in_dates 
       from dts x, dts y
       where x.log_in_dates >= y.log_in_dates

I don't understand how I can get the fourth column Num_users_no_log_in


Solution

  • I do not really understand your need: are there values base on users or dates? It it's based on dates, as it looks like (elsewhere you would probably have user_id as first column), what does it mean to have multiple times the same date? I understand that you would like to have a recap for all dates since the beginning until the current date, but in my opinion in does not really make sens (imagine your dashboard in 1 year!!)

    Once this is said, let's go to the approach.
    In such cases, I develop step by step using common table extensions. For you example, it required 3 steps:

    • prepare the time series
    • integrate connections' dates and perform the first calculation (time difference)
    • Finally, calculate nb connection per day

    Then, the final query will display the desired result.

    Here is the query I proposed, developed with Postgresql (you did not precise your dbms, but converting should not be such a big deal here):

    with init_calendar as (
       -- Prepare date series and count total users
       select generate_series(min(log_in_dates), now(), interval  '1 day') as present_date,
              count(distinct user_id) as nb_users
         from users
    ),
    calendar as (
       -- Add connections' dates for each period from the beginning to current date in calendar
       -- and calculate nb days difference for each of them
       -- Syntax my vary depending dbms used
       select distinct present_date, log_in_dates as last_date,
              extract(day from present_date - log_in_dates) as days_difference,
              nb_users
         from init_calendar
         join users on log_in_dates <= present_date
    ),
    usr_con as (
        -- Identify last user connection's dates according to running date
        -- Tag the line to be counted as no connection
        select c.present_date, c.last_date, c.days_difference, c.nb_users,
               u.user_id, max(log_in_dates) as last_con,
               case when max(log_in_dates) = present_date then 0 else 1 end as to_count
          from calendar c
          join users u on u.log_in_dates <= c.last_date
          group by c.present_date, c.last_date, c.days_difference, c.nb_users, u.user_id
    )
    select present_date, last_date, days_difference,
           nb_users - sum(to_count) as Num_users_no_log_in
    from usr_con
     group by present_date, last_date, days_difference, nb_users
     order by present_date, last_date
    

    Please note that there is a difference with your own expected result as you forgot user_id = 3 in your calculation. If you want to play with the query, you can with dbfiddle