Search code examples
google-bigquerypartitionrow-number

BigQuery: Create View using ROW_NUMBER function breaks partition filter policy


we have a table created in BQ, 'TS' column used as partitioning column when create the table, like "PARTITION BY DATE(TS)". and we set "require_partition_filter=true"

When we create view like below, query on view works:

CREATE OR REPLACE VIEW mydataset.test_view AS select * from 
mydataset.test_table
--query based on view
select * from mydataset.test_view where TS > TIMESTAMP("2021-09-05 08:30:00")

However, if we add ROW_NUMBER() function in view creation statement, the same query on view would raise error:

CREATE OR REPLACE VIEW mydataset.test_view AS 
select *, ROW_NUMBER() over (
partition by ID
order by ID, TS
) as row_number from mydataset.test_table
--query based on view return error
select * from mydataset.test_view where TS > TIMESTAMP("2021-09-05 08:30:00")
--error msg: Cannot query over table without a filter over column(s) 'TS' 
that can be used for partition elimination

What's the reason for it and what's available solution? Appreciate any ideas/thoughts. Thanks.

Update: We also tried adding 'where' in the view creation. however, for our case, we can not limit the real data time range exposed through view, user want to have capability to query all the data from view. That means we can only filter TS with an always true condition like 'TS is not null' or 'TS > TIMESTAMP("1970-01-01 00:00:00").' By doing this, query on view will not throw error, but it has very poor performance as actually no partition pruning when execute the query even we add additional TS filter when query on view.


Solution

  • As per this doc when you are enabling the require_partition_filter=true, and when you attempt to query the table without a WHERE clause it throws the error you are getting.

    From the first view you have created and in the query, you have passed the where clause and as a result you are getting the output.

    In the second view, you have added the ROW_NUMBER() function and when querying that view you are getting the error message. The error is happening because of the filter conditions you have passed in the query.

    I tried replicating your issue with a sample set of data. The workaround is to create a view with the Where condition(with a subset of the data in the source table) and then query the view as required. You can refer to the below query and let me know if this workaround helps you:

    Creating a view :

    CREATE OR REPLACE VIEW dataset2.view3 AS
    select *, ROW_NUMBER() over (
    partition by id
    order by id, ts
    ) as rownum from `myproject.dataset2.part4` where ts > TIMESTAMP("2021-09-05 08:30:00")
    
    

    Here in the view along with ROW_NUMBER(), I have added the Where clause in the query.

    Query :

    Select * from `myproject.dataset2.view3`
    

    Output :

    enter image description here

    If you add another where clause when querying over the view, it will work and it will not throw any error.

    Query :

    SELECT * FROM `myproject.dataset2.view3` where ts > TIMESTAMP("2021-10-30 10:20:02")
    

    Output :

    enter image description here

    As per your reply since you cannot limit the real data time range exposed through view you can refer to the below cases :

    Case 1 (partition by id):

    CREATE OR REPLACE VIEW dataset2.view8 AS
    select *, ROW_NUMBER() over (
    partition by id
    order by id,ts
    ) as rownum from `myproject.dataset2.part4` where ts IS NOT NULL
    
    

    Query 1 (on the view) :

    SELECT * FROM `myprojectproject.dataset2.view8`
    

    It processed 160B of data.

    Query 2 (on the view) :

    SELECT * FROM `myproject.dataset2.view 8` where ts > TIMESTAMP("2021-09-05 08:30:00")
    

    It also processed 160B of data which means if you are doing partition by id, partition pruning is not happening.

    Case 2 (Partition by ts):

    CREATE OR REPLACE VIEW dataset2.view8 AS
    select *, ROW_NUMBER() over (
    partition by ts
    order by id,ts
    ) as rownum from `myproject.dataset2.part4` where ts IS NOT NULL
    

    Query1 (on the view) :

    SELECT * FROM `myproject.dataset2.view 8`
    

    It processed 160B of data.

    Query2 (on the view ) :

    SELECT * FROM `myproject.dataset2.view 8` where ts > TIMESTAMP("2021-09-05 08:30:00")
    

    It returned 128B of data which means partitioning by ts does partitioning pruning.

    Since your requirement is to attain partition pruning, I tried partition by ts because, after the view is created you are going to query the view not the table. So to attain partition pruning , try to use the same partition condition when creating the view i.e partition by ts that you have used while creating the table and you can use the WHERE clause while querying the view.

    If this does not fulfill your requirement, provide your sample data with output and explain the use case of why you want to partition by id in the view.