Search code examples
google-bigqueryduplicatesrow

BigQuery : how to build a new row based on duplicates


In BigQuery, I got this type of data below :

#standardSQL
WITH name_table AS (
  SELECT 'a' id, 1 hitnumber, 'alpha' page UNION ALL
  SELECT 'a', 2, 'beta' UNION ALL
  SELECT 'a', 3, 'beta' UNION ALL
  SELECT 'a', 4, 'alpha' UNION ALL 
  SELECT 'a', 5, 'beta' UNION ALL
  SELECT 'b', 1, 'gamma' UNION ALL
  SELECT 'b', 2, 'gamma'
 )
SELECT *
FROM name_table

 Output :
    id| hitnumber|  page 
    a |   1      |  alpha
    a |   2      |  beta 
    a |   3      |  beta 
    a |   4      |  alpha
    a |   5      |  beta
    b |   1      |  gamma
    b |   2      |  gamma

And I want to build a new hitnumber based on duplicate pages as below.

id| hitnumber|  page |  new_hitnumber
a |   1      |  alpha|  1
a |   2      |  beta |  2
a |   3      |  beta |  2
a |   4      |  alpha|  3
a |   5      |  beta |  4
b |   1      |  gamma|  1
b |   2      |  gamma|  1

Or if it's possible to deduplicate directly and obtain :

id| page |  new_hitnumber
a | alpha|  1
a | beta |  2
a | alpha|  3
a | beta |  4
b | gamma|  1

I tried with ROW_NUMBER() or RANK() but without success.

Many thanks in advance for your help.

Arnaud


Solution

  • want to build a new hitnumber based on duplicate pages as below.

    Use below approach

    select * except(new_group),
      countif(ifnull(new_group, true)) over(partition by id order by hitnumber) new_hitnumber
    from (
      select *, 
        page != lag(page) over(partition by id order by hitnumber) new_group
      from name_table
    )
    # order by id, hitnumber           
    

    if applied to sample data in your question - output is

    enter image description here

    Or if it's possible to deduplicate directly

    select * except(hitnumber) from (
      select * except(new_group),
        countif(ifnull(new_group, true)) over(partition by id order by hitnumber) new_hitnumber
      from (
        select *, 
          page != lag(page) over(partition by id order by hitnumber) new_group
        from name_table
      )
    )
    where true 
    qualify row_number() over(partition by id, page, new_hitnumber order by hitnumber) = 1
    order by id, new_hitnumber         
    

    in this case - output is

    enter image description here