Search code examples
sqlgoogle-bigquerylooker-studioretention

New and Retained duplicates in Bigquery


I'm using BigQuery and DataStudio to show retention plotted throughout the weeks and i'm having trouble with that when a user is new and then uses the app again the same week he's both new and retained in my calculations and I would like it to be so that he's only new the first week of using the app and then if he uses the app again within 2 weeks, he's "Retained".

This is my query:

SELECT
UserID,
DATE,
DATE_DIFF(DATE,PreviousSessionDATE, DAY) as DaysBetweenSessions,
(SELECT
 CASE
WHEN DaysBetweenSessions <= 14 THEN 'Retained'
WHEN DaysBetweenSessions >14 THEN 'Returned'
WHEN DaysBetweenSessions IS NULL AND FirstSessionDATE = DATE THEN 'New'
WHEN DaysBetweenSessions IS NULL THEN 'User has an old version without Retention Parameters'
END) as User_Type
FROM
app_project.analytics_*********.events_*
GROUP BY
1,2,3,4
ORDER BY
DATE DESC,
DaysBetweenSessions DESC,
1,2,3,4

Then the result will generally be good except for when a user uses the app several times and get a between 1-14 as DaysBetweenSessions and then be counted as both new and retained within the same week.

Then in DataStudio I will do YEARWEEK(DATE) to vizualise per week and Count_Distinct(UserID) as my metrics.

Any ideas on how i can sort out so that the new user is only counted as new in the first week even if the user trains within that week?

Current output in BQ:
UserID     DATE        DaysBetweenSessions     User_Type
123        20180801    NULL                    "New"
123        20180801    0                       "Retained"

And desired output

UserID     DATE        DaysBetweenSessions     User_Type
123        20180801    NULL                    "New"

Solution

  • There might be a neater way to do this but...

    WITH CTE AS
    (SELECT
    UserID,
    DATE,
    DATE_DIFF(DATE,PreviousSessionDATE, DAY) as DaysBetweenSessions,
    (SELECT
    CASE
    WHEN DaysBetweenSessions <= 14 THEN 'Retained'
    WHEN DaysBetweenSessions >14 THEN 'Returned'
    WHEN DaysBetweenSessions IS NULL AND FirstSessionDATE = DATE THEN 'New'
    WHEN DaysBetweenSessions IS NULL THEN 'User has an old version without Retention 
    Parameters'
    END) as User_Type,
    (SELECT
    CASE
    WHEN DaysBetweenSessions <= 7 THEN 0
    WHEN DaysBetweenSessions >7 THEN 1
    WHEN DaysBetweenSessions IS NULL AND FirstSessionDATE = DATE THEN 0
    WHEN DaysBetweenSessions IS NULL THEN 2
    END) as DaysBetween
    FROM
    app_project.analytics_*********.events_*
    GROUP BY
    1,2,3,4
    ORDER BY
    DATE DESC,
    DaysBetweenSessions DESC,
    1,2,3,4),
    
    Result as 
    (SELECT *, min(User_Type) OVER (PARTITION BY UserID, DaysBetween) minUser_Type
    FROM CTE)
    
    SELECT UserID,
    DATE,
    DaysBetweenSessions,
    User_type 
    FROM Result 
    WHERE NOT (User_Type <> 'New' AND minUser_Type = 'New')
    

    The second part should add a dimension which is the alphabetically lowest User_Type in that week (so if you renamed anything to alphabetically less than 'New' it wouldn't work, better using numbers ideally).

    The last part should get rid of the ones where there was a 'New' in that week but the row User_Type isn't 'New'.