Search code examples
sqlgroup-bygoogle-bigquerydistinct-on

How to remove duplicates in query for google big query by a subset of returned rows, and keep first?


In pandas, I can drop duplicates by a subset of columns and keep first by

df = df.drop_duplicates(subset=['column1', 'column6'], keep='first')

I am trying to figure out how to do the same in a query for Google big query.

I saw that GROUP BY is what is recommended for such a case, so I tried something like

query = """
SELECT
table1.column1,
table1.column2,
table1.column3,
table2.column4,
table2.column5,
table3.column6,
FROM table1
JOIN table2
ON
(
table1.columnA = table2.columnA
AND
table1.columnB = table2.columnB
AND
table1.columnC = table2.columnC
)
JOIN table3
ON
table3.column6 = table2.column6
GROUP BY
table1.column1
table3.column6
"""

I get an error like

select list expression references tableX.columnY which is neither grouped nor aggregraed at [4:5]

It seems that since I have multiple other column values when I group by the columns I want to GROUP BY, I get an error. After some more googling, I saw that I should use DISTINCT ON to keep the first of the multiple values for my GROUP BY. So then I tried

query = """
SELECT DISTINCT ON (table1.column1, table3.column6)
table3.column6
table1.column1,
table1.column2,
table1.column3,
table2.column4,
table2.column5,
table3.column6,
FROM table1
JOIN table2
ON
(
table1.columnA = table2.columnA
AND
table1.columnB = table2.columnB
AND
table1.columnC = table2.columnC
)
JOIN table3
ON
table3.column6 = table2.column6
"""

But I am now getting an error that says something like

400 syntax error, unexpected keyword ON at

From searching, it seems that Google Bigquery does not use DISTINCT ON. But the only solution I can find is to use GROUP BY, which already is not working.

So in a query, how can I drop duplicates for a subset of columns, and drop rows where grouping by that subset has multiple values for other columns.

A solution that de-duplicates based on a subset after a query won't work for me, since my issue is the query without deduplication takes a long time to finish due to the large number of rows returned.


Solution

  • As @Jaytiger has mentioned in the comments, we have to use the ROW_NUMBER() function coupled with PARTITION BY and ORDER BY clauses.

    Consider the query below. I have tested the query on sample data and have compared the results with that of a pandas snippet.

    SELECT * from
    (
      SELECT
      *,
      ROW_NUMBER() OVER (PARTITION BY column1, column6 ORDER BY columnX) row_num
    FROM
      `<project-id>.test_dataset.keep_first_in_duplicate`
    )
    where row_num=1
    

    The usage of the ORDER BY clause depends on the requirement, the requirement being order preservation of the input data. Unlike a pandas dataframe, the order of input data is not preserved in BigQuery. If we wish to preserve the order, we have to have a new column with indices that can be used to sort the data after ingesting into BigQuery. In summary, if your data source follows a certain order, there will be differences between the deduplication output from BigQuery and that of the pandas dataframe.