Search code examples
sqlgoogle-bigquerygoogle-bigquery-java

SQL query to join tables with duplicate rows and getting unique rows


I got the following three tables in BigQuery along with the data below.

I'd like to write a SQL query using id and current date columns and get the following row as the output.

Expected result set:

id            existing_stores                             missing_stores
1003812607    "3640,0130,0131,2306,3638,0127,2789,2305"   "3102,2681,2686,2670,2682,3101,2673,2669,3103,2668"

These are the tables:

item table:

id            date
------------------------
1003812607    2023-08-03
1003812607    2023-08-01
1003812607    2023-07-23
1003812607    2023-06-30

item_change_history:

createdTime                     docType    id       existing_stores
---------------------------------------------------------------------------------------------
2023-08-03 11:01:10.139617 UTC  Item    1003812607  "3640,0130,0131,2306,3638,0127,2789,2305"
2023-07-01 09:01:10.139617 UTC  Item    1003812607  "3640,0130,0131,2306,3638,0127,2789,2301"

mic_item_discrepancy:

ID          MISSING_STORE
-------------------------
1003812607  3102
1003812607  2681
1003812607  2686
1003812607  2670
1003812607  2682
1003812607  3101
1003812607  2673
1003812607  2669
1003812607  3103
1003812607  2668

I tried to come up with this query and it is not working as expected or giving me the wrong data given that duplicate id rows in mic_item_discrepancy table.

SELECT 
    item.id, ich.id, ich.existing_stores, mid.missing_stores
FROM 
    `hd-merch-prod.merch_item_cache_validation.item` item 
JOIN
    `hd-merch-prod.merch_item_cache.item_change_history` ich 
         ON item.date = "2023-08-03" 
         AND DATE(ich.createdTime) = item.date 
         AND ich.id = item.id  
JOIN
    `hd-merch-prod.merch_item_cache.mic_item_discrepancy` mid 
         ON item.id = mid.id
GROUP BY
    item.id, ich.id, ich.existing_stores, mid.missing_stores;

Solution

  • Try to use LEFT JOIN for both the item_change_history and mic_item_discrepancy tables to ensure that all rows from the item table are included in the result, and add a DISTINCT in the STRING_AGG function:

    SELECT 
        item.id, ich.id, ich.existing_stores, COALESCE(mid.missing_stores, '') AS missing_stores
    FROM 
        `hd-merch-prod.merch_item_cache_validation.item` item 
    LEFT JOIN
        (
          SELECT id, existing_stores, MAX(createdTime) as latest_createdTime
          FROM 
            `hd-merch-prod.merch_item_cache.item_change_history`
          WHERE 
            DATE(createdTime) = '2023-08-03'
          GROUP BY 
            id, existing_stores
        ) ich ON item.id = ich.id  
    LEFT JOIN
        (
          SELECT ID, STRING_AGG(DISTINCT MISSING_STORE, ',') AS missing_stores 
          FROM 
            `hd-merch-prod.merch_item_cache.mic_item_discrepancy`
          GROUP BY 
            ID
        ) mid ON item.id = mid.ID;