We are exploring the use of BigQuery policy tags for implementing column level security. We have discovered that when a column that a user does not have access to is included in a GROUP BY clause then the user cannot access other columns in the GROUP BY clause either.
Let me explain with a contrived example. Imagine we have a table t
with two columns:
t.a
t.b
These columns have policy tags on them which grant our end users permission to view the data in t.a
but not the data in t.b
.
Hence our end users can issue this query:
select a
from t
but when issuing this query:
select b
from t
they get an error:
Access Denied: BigQuery BigQuery: User does not have permission to access policy tag "policy-tag-name" on column project.dataset.t.b.
This is expected and is by design.
However if a user issues this query:
select a
from (
select a, b
from t
group by a, b
)
then the same error occurs.
Why is this? The user is not accessing any data that they are not allowed to access. I am struggling to comprehend why the group by statement causes this error. Can anyone enlighten me?
This query does not expose values b, but it would expose some information about b if it were allowed.
E.g. it lets one know how many different values of b each value of a has. Say a is a product id, and b is seller id. Query would show how many separate sellers each product has. But whoever decided to hide sellers probably does not want this. Or a is some employee group (e.g. department or manager id), and b is salary. This query would expose whether they have the same salary. Policy prevents this.
Note that there could be more complex queries where referencing b does not expose anything about b, but policy prevents it anyway. It is also by design - BigQuery plays it safe, and if it cannot prove such reference does not expose any information, it prohibits it.