I have a table with more than 500k rows and columns like below;
user_id | event_date | event_name | version|
===============================================
1543435 | 18092021 | first_open | 113
5476523 | 18092021 | session_start| 111
7418529 | 18092021 | first_open | 112
1754821 | 18092021 | first_open | 113
9820011 | 18092021 | session_start| 114
4568714 | 18092021 | session_start| 120
Users that have Event_name with first_open means that the user installed and opened the application for the first time, while session_start means the user installed and opened before, it's not the first time.
user_id is unique for each user and does not change for each login.
We are dealing with users that only have the version 113.
I need to find the proportion of users who installed the application on the 18th (18.09.2021) and opened it again on the 1st (19.09.2021) and 3rd (21.09.2021) days.
Roadmap:
After a week of researching and brainstorming, i have written the query below:
SELECT
(SELECT COUNT(DISTINCT our_data.user_id)
FROM our_data WHERE our_data.event_date = '20210918'
AND our_data.event_name ='first_open'
AND our_data.version = '113')
AS DAY_ZERO,
(SELECT COUNT(DISTINCT dayone.user_id)
FROM our_data
LEFT JOIN our_data AS dayone
ON our_data.user_id = dayone.user_id)
WHERE our_data.event_date = '20210918'
AND dayone.event_date = our_data.event_date +1
AND our_data.event_name ='first_open'
AND dayone.event_name ='session_start'
AND our_data.version = '113'
AND dayone.version = '113')
AS DAY_ONE,
(SELECT COUNT(DISTINCT our_data.user_id)
FROM our_data
LEFT JOIN our_data as daythree
ON our_data.user_id = daythree.user_id
WHERE our_data.event_date = '20210918'
AND daythree.event_date = our_data.event_date +3
AND our_data.event_name ='first_open'
AND daythree.event_name ='session_start'
AND our_data.version = '113'
AND daythree.version = '113')
AS DAY_THREE
This query gave me these result:
DAY_ZERO | DAY_ONE | DAY_THREE |
========================================
14879 | 7850 | 949 |
Among these results, I could not perform any operation in the same query. I need to reach Day_ONE/DAY_ZERO= DAY 1 RETENTION and DAY_THREE/DAY_ZERO= DAY3 RETENTION. I need to do these operations for other first days in the same table, so I have to do it in a single query. How do you think I can do?
I would have posted this in a comment, but the query seems a bit too long for a comment. In short, the best answer would depend a bit on the SQL flavour you actually use (T-SQL? PL/SQL? PL/pgSQL?), but the generic approach would be the same.
You have calculated the required values in the the subqueries already. I have re-used those subqueries, but instead of the row values, made those into tables with unique columns names.
Now you have three tables, each having one row. Cross join those to get a "single row" with three different columns... and do the desired calculations on the columns
select
DAY_ONE.cnt1/DAY_ZERO.cnt0 as DAY_1_RETENTION,
DAY_THREE.cnt3/DAY_ZERO.cnt0 as DAY_3_RETENTION
FROM (SELECT COUNT(DISTINCT our_data.user_id) AS cnt0
FROM our_data WHERE our_data.event_date = '20210918'
AND our_data.event_name ='first_open'
AND our_data.version = '113')
AS DAY_ZERO,
(SELECT COUNT(DISTINCT dayone.user_id) cnt1
FROM our_data
LEFT JOIN our_data AS dayone
ON our_data.user_id = dayone.user_id)
WHERE our_data.event_date = '20210918'
AND dayone.event_date = our_data.event_date +1
AND our_data.event_name ='first_open'
AND dayone.event_name ='session_start'
AND our_data.version = '113'
AND dayone.version = '113')
AS DAY_ONE,
(SELECT COUNT(DISTINCT our_data.user_id) cnt3
FROM our_data
LEFT JOIN our_data as daythree
ON our_data.user_id = daythree.user_id
WHERE our_data.event_date = '20210918'
AND daythree.event_date = our_data.event_date +3
AND our_data.event_name ='first_open'
AND daythree.event_name ='session_start'
AND our_data.version = '113'
AND daythree.version = '113')
AS DAY_THREE