Search code examples
mysqlsqlfirebasegoogle-analyticsgoogle-bigquery

How to find median using Google BigQuery based on both the user id and the number of levels the ID has visited?


I have about 100,000 user IDs who are visiting n number of levels. I need to find the overall median of the users and levels visited by each.

I've tried to use AVG based on the number of IDs on each level and the total IDs who started the application. The values are varying alot.

To find IDs who started the application.

SELECT 
event_names, COUNT(DISTINCT id) uniques, COUNT(id) AS total 
FROM xyz.analytics_111.xyz 
WHERE (date BETWEEN "20191018" AND "20191024") AND version = "3.1" AND event_names in ("app_open","internet") AND platform = "ANDROID" 
AND id IN ( SELECT DISTINCT id FROM abc.analytics_111.abc WHERE event_names = "internet" AND internet_status = 1 ) 
GROUP BY event_names

To find total users on each level.

SELECT event_names, story_name, level, COUNT(DISTINCT id) uniques, COUNT(id) AS total 
FROM xyz.analytics_111.xyz WHERE (date BETWEEN "20191018" AND "20191024") AND version = "3.1" AND event_names in ("start_level","end_level") AND platform = "ANDROID" AND id IN ( SELECT DISTINCT id FROM abc.analytics_111.abc 
WHERE event_names = "internet" AND internet_status = 1 ) 
GROUP BY event_names, story_name, level ORDER BY event_names DESC, story_name, level

After this I'm dividing the Sum of User ID count on all levels by the number of User ID who started the application to get the AVG of Levels visited by each user. Is there a way to find a median?


Solution

  • The question doesn't have enough details for a complete answer, but with the elements you've given us:

    • Don't use AVG, when you want MEDIAN

    To calculate a median you can do something like:

    SELECT level_id, fhoffa.x.median(ARRAY_AGG(some_number))
    FROM `table`
    GROUP BY level_id
    

    bqutil.fn.median() is a public UDF we shared with the world:


    Now, with the extra details you provided, if you want

    to get the AVG of Levels visited by each user

    , then:

    SELECT AVG(levels) avg_levels_for_users
    FROM (
      SELECT id user, COUNT(DISTINCT level) levels
      FROM `....`
      GROUP BY user
    )