Search code examples
sqlrandomgroup-bygoogle-bigquerysampling

group by id and random sample by id from two tables in big query sql


I have a 2 tables with the same structure:

table1

id text        var  
1  "bla bla"   100
1  "blabla1"    30
2  "qweweqty"    0
2    etc...
7
3
3
1
..
100

table2

id text        var  
101 "bla bla"   10
101  "bla1"      60
101  "bla"    5
103    etc...
102
103
102
110
..
200

I want to randomly sample data from table1 and table2 based on id. so basically, sample every observation for a random sample of ids from table1 and and every observation from random sample of ids from table2 so that 50 ids are from table 1 and 50 are from table 2 . any idea on how to do this on big query SQL?

en example would be the following where I want to get 3 ids from table1 and 3 from table2

randomly ids 1,2,3 are selected from table1, ids 101, 110 and 103 are selected from table2

the resulting table is then:

id. text var
 1.   ..  ..
 1
 2
 2
 3
 3
 1
 101
 101
 101
 103
 103
 110

so basically any observation from table1 with id 1,2,3 and any observation from table2 with id 101, 103, 110 are selected and put in the same table: so the passages are two: first randomly select a certain number of ids from table1, and a certain number of ids from table2, then I select any observation corresponding to those ids from both tables and I join them in the same table


Solution

  • If you want 50 ids from each table, then you can limit them using subqueries:

    select t1.*
    from t1 join
         (select distinct id
          from t1
          order by rand()
          limit 50
         ) ids
         on t1.id = ids.id
    union all
    select t2.* 
    from t2 join
         (select distinct id
          from t2
          order by rand()
          limit 50
         ) ids
         on t2.id = ids.id