Search code examples
sqldatabasegoogle-bigqueryaggregationconsolidation

Google-Bigquery: consolidate aggregate


I'm trying to create a query which does something a little complex, and I haven't been able to find anything that might point me in the right direction. Perhaps YOU can help!

This is the source data:

7457, "05:06:26 UTC", 15
7457, "05:06:26 UTC", 15
7457, "05:06:26 UTC", 15
7457, "05:06:26 UTC", 15 
2341, "05:12:34 UTC", 10
2341, "05:12:34 UTC", 10
2341, "05:12:34 UTC", 10
2341, "05:12:34 UTC", 10
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:34 UTC", 15
5678, "05:12:39 UTC", 15
5678, "05:12:39 UTC", 15
1111, "06:00:00 UTC", 10
2222, "07:00:00 UTC", 15
3333, "08:00:00 UTC", 10

I have a query to find duplicated stats:

SELECT userID, timestamp, statType, COUNT(*) - 1 AS DuplCount
FROM [dataset1.table1] 
GROUP BY userID, timestamp, statType 
HAVING DuplCount > 0;

(Note that only stats that have the same userID and timestamp can be considered duplicates.)

This results in a table that looks like

userID  timestamp       statType    DuplCount    
7457    05:06:26 UTC    15          3    
2341    05:12:34 UTC    10          3    
5678    05:12:34 UTC    15          7     
5678    05:12:39 UTC    15          1   

I want to further consolidate this data so it can inserted as one row in another table: the sum of the counts of duplicates of the same statType. I want it to look something like

table            stat10DuplCount  stat15DuplCount    
dataset1.table1  3                11 

I'm not sure how to proceed... can this all be done in one query (preferred), or do I need to do multiple queries or do some post-query data processing?


Solution

  • I have figured out how to do exactly what I want; the only difference between this query and Felipe's is that it takes the sum of the duplicates instead of counting each set of duplicates as one occurrence.

    SELECT "dataset1.table1" table, SUM(IF(statID=10,DuplCount,null)) stat10DuplCount, SUM(IF(statID=15,DuplCount,null)) stat15DuplCount, 
    FROM (
        SELECT userID, timestamp, statType, COUNT(*) - 1 AS DuplCount
        FROM [dataset1.table1] AS statsTable
        GROUP BY userID, timestamp, statType 
        HAVING DuplCount > 0
    );
    

    Which results in:

    table            stat10DuplCount  stat15DuplCount    
    dataset1.table1  3                11