Search code examples
sqlsnowflake-cloud-data-platformsample

SQL Sample Function Seems to be returning more than the specific percentage


I am trying to assign a control status of true to 10% of distinct groups in my table. Everything runs fine, but when I view the results, the percentages are way off. I'm getting more like 1/3 of records with the true control status, instead of the desired 10%. I've been using sample based on the distinct group id, but please let me know if something looks off here. Thanks!

PS: This is SQL Snowflake

Code and results listed below :

update table 
    set control_status=true
    where group_id in
       (select DISTINCT(group_id) from table sample(10));

update table 
    set control_status=false
    where control_status is null;


select control_status, count(distinct(group_id)),count(distinct(person_id)), count(control_status) from table group by control_status;

Results

control_status count(distinct(group_id)) count(distinct(person_id))
True 50,000 100,000
False 100,000 200,000

Solution

  • I think what you're looking for is the following - you need to create a representative set of the distinct group_ids in order to chose 10% of those and not be skewed by the rows in the underlying table that pertain to people who have a 1-M relationship with group_id:

    update table_name 
        set control_status=true
        where group_id in
           (select group_id from (select distinct (group_id) from table_name) x sample(10));
    

    The rest remains the same as your original post:

    update table 
        set control_status=false
        where control_status is null;
    
    
    select control_status, count(distinct(group_id)),count(distinct(person_id)), count(control_status) from table group by control_status;