Search code examples
sqlselectamazon-redshiftsql-insertsample

Redshift sample from table based on count of another table


I have TableA of, say, 3000 rows (could be any number < 10000). I need to create TableX with 10000 rows. So I need to select random 10000 - (number of rows in TableA) from TableB (and add in TableA as well) to create TableX. Any ideas please? Something like this (which obviously wouldnt work):

Create table TableX as
select * from TableA
union
select * from TableB limit (10000 - count(*) from TableA);

Solution

  • You could use union all and window functions. You did not list the table columns, so I assumed col1 and col2:

    insert into tableX (col1, col2)
    select col1, col2 from table1
    union all 
    select t2.col1, t2.col2
    from (select t2.*, row_number() over(order by random()) from table2 t2) t2
    inner join (select count(*) cnt from table1) t1 on t2.rn <= 10000 - t1.cnt
    

    The first query in union all selects all rows from table1. The second query assigns random row numbers to rows in table2, and then selects as many rows as needed to reach a total of 10000.

    Actually it might be simpler to select all rows from both tables, then order by and limit in the outer query:

    insert into tableX (col1, col2)
    select col1, col2
    from (
        select col1, col2, 't1' which from table1
        union all 
        select col1, col2, 't2' from table2
    ) t
    order by which, random()
    limit 10000