I am trying to assign a control status of true to 10% of distinct groups in my table. Everything runs fine, but when I view the results, the percentages are way off. I'm getting more like 1/3 of records with the true control status, instead of the desired 10%. I've been using sample based on the distinct group id, but please let me know if something looks off here. Thanks!
PS: This is SQL Snowflake
Code and results listed below :
update table
set control_status=true
where group_id in
(select DISTINCT(group_id) from table sample(10));
update table
set control_status=false
where control_status is null;
select control_status, count(distinct(group_id)),count(distinct(person_id)), count(control_status) from table group by control_status;
Results
control_status | count(distinct(group_id)) | count(distinct(person_id)) |
---|---|---|
True | 50,000 | 100,000 |
False | 100,000 | 200,000 |
I think what you're looking for is the following - you need to create a representative set of the distinct group_ids in order to chose 10% of those and not be skewed by the rows in the underlying table that pertain to people who have a 1-M relationship with group_id:
update table_name
set control_status=true
where group_id in
(select group_id from (select distinct (group_id) from table_name) x sample(10));
The rest remains the same as your original post:
update table
set control_status=false
where control_status is null;
select control_status, count(distinct(group_id)),count(distinct(person_id)), count(control_status) from table group by control_status;