Search code examples
influxdb

Missing data using GROUP BY with InfluxDB subqueries


This is some seed data which contains 30 events. For each of two possible values in user_id, there are 5 values for event, and each of those is duplicated among 3 possible times.

> SELECT time,user_id,event,score_value FROM user_scores
name: user_scores
time                user_id event                   score_value
----                ------- -----                   -----------
1517616000000000000 456     card_comment_created    10
1517616000000000000 123     card_comment_created    5
1517616000000000000 123     card_created            5
1517616000000000000 456     card_created            10
1517616000000000000 456     card_liked              10
1517616000000000000 123     card_liked              5
1517616000000000000 123     card_marked_as_complete 5
1517616000000000000 456     card_marked_as_complete 10
1517616000000000000 123     card_viewed             5
1517616000000000000 456     card_viewed             10
1517702400000000000 456     card_comment_created    10
1517702400000000000 123     card_comment_created    5
1517702400000000000 123     card_created            5
1517702400000000000 456     card_created            10
1517702400000000000 456     card_liked              10
1517702400000000000 123     card_liked              5
1517702400000000000 456     card_marked_as_complete 10
1517702400000000000 123     card_marked_as_complete 5
1517702400000000000 123     card_viewed             5
1517702400000000000 456     card_viewed             10
1517788800000000000 456     card_comment_created    10
1517788800000000000 123     card_comment_created    5
1517788800000000000 123     card_created            5
1517788800000000000 456     card_created            10
1517788800000000000 456     card_liked              10
1517788800000000000 123     card_liked              5
1517788800000000000 456     card_marked_as_complete 10
1517788800000000000 123     card_marked_as_complete 5
1517788800000000000 123     card_viewed             5
1517788800000000000 456     card_viewed             10
>

I am downsampling the data into a daily aggregation using the following query:

    SELECT \
      user_id,total_user_score,smartbites_commented_count,\
      smartbites_completed_count,smartbites_consumed_count,\
      smartbites_liked_count \
   INTO user_scores_daily \
   FROM ( \
     SELECT SUM(score_value) AS total_user_score \
     FROM user_scores \
     GROUP BY time(1d),user_id \
   ),( \
     SELECT COUNT(score_value) AS smartbites_commented_count \
     FROM user_scores \
     WHERE event='card_comment_created' \
     GROUP BY time(1d),user_id \
   ),( \
     SELECT COUNT(score_value) AS smartbites_completed_count \
     FROM user_scores \
     WHERE event='card_marked_as_complete' \
     GROUP BY time(1d),user_id \
   ),( \
     SELECT COUNT(score_value) AS smartbites_consumed_count \
     INTO smartbites_consumed_counts_daily \
     FROM user_scores \
     WHERE event='card_viewed' \
     GROUP BY time(1d),user_id \
   ),( \
     SELECT COUNT(score_value) AS smartbites_liked_count \
     FROM user_scores \
     WHERE event='card_liked' \
     GROUP BY time(1d),user_id \
   )

Notice how each of the subqueries is grouping by time(1d) and user_id. I need a row in the results for each user/day combination.

This is my results:

> SELECT * FROM user_scores_daily
name: user_scores_daily
time                smartbites_commented_count smartbites_completed_count smartbites_consumed_count smartbites_liked_count total_user_score user_id
----                -------------------------- -------------------------- ------------------------- ---------------------- ---------------- -------
1517616000000000000 1                          1                          1                         1                      50               456
1517702400000000000 1                          1                          1                         1                      50               456
1517788800000000000 1                          1                          1                         1                      50               456

The data for one of the users looks perfect. But what about the second user? There should be six total rows, but there are only 3. It's missing three rows where user_id=123.

edit in response to comment:

> SHOW TAG KEYS FROM "user_scores"
name: user_scores
tagKey
------
actor_id
analytics_version
event
owner_id
role
user_id
> SHOW FIELD KEYS FROM "user_scores"
name: user_scores
fieldKey    fieldType
--------    ---------
score_value integer
>

Solution

  • What I ended up doing is adding GROUP BY user_id,time(1d) to my top level query (after the subqueries), and changing the fields selected by my outermost SELECT into aggregations.

    These aggregations are redundant, but if I am going to use GROUP BY on the top level query I need to use them.

    Code looks like so:

       SELECT \
         MEAN(user_id) as user_id,\
         MEAN(total_user_score) as total_user_score,\
         MEAN(smartbites_commented_count) as smartbites_commented_count,\
         MEAN(smartbites_consumed_count) as smartbites_consumed_count,\
         MEAN(smartbites_liked_count) as smartbites_liked_count,\
       INTO user_scores_daily \
       FROM ( \
         SELECT SUM(score_value) AS total_user_score \
         FROM user_scores \
         GROUP BY time(1d),user_id \
       ),( \
         SELECT COUNT(score_value) AS smartbites_commented_count \
         FROM user_scores \
         WHERE event='card_comment_created' \
         GROUP BY time(1d),user_id \
       ),( \
         SELECT COUNT(score_value) AS smartbites_completed_count \
         FROM user_scores \
         WHERE event='card_marked_as_complete' \
         GROUP BY time(1d),user_id \
       ),( \
         SELECT COUNT(score_value) AS smartbites_consumed_count \
         INTO smartbites_consumed_counts_daily \
         FROM user_scores \
         WHERE event='card_viewed' \
         GROUP BY time(1d),user_id \
       ),( \
         SELECT COUNT(score_value) AS smartbites_liked_count \
         FROM user_scores \
         WHERE event='card_liked' \
         GROUP BY time(1d),user_id \
       ) \
       GROUP BY time(1d),user_id
    

    That's one MEAN query if I may say so myself (I'll show myself out)