Search code examples
sql-servert-sqlrandom

Generating Bootstrapped Samples in T-SQL


Does anyone have some ideas on how to generate a bootstrapped sample of rows in T-SQL? The problem with using something like NEWID() or RAND() then just sorting is that you end up generating a random sample without replacement. Bootstrapping should be done by sampling with replacement.


Solution

  • Here's one solution. It starts by assigning a row identifier to each row in the population table.

    Next, a uniform random integer between 1 and the rowcount of the population, inclusive, is generated using a call to RAND() via a view and the RandMod() function, as described in Question 3531450.

    These two queries are JOINed together to create a random sample with replacement.

    To get a bootstrapped sample of size K < N, sort on NEWID() and grab the top K rows.

    SELECT TOP K 
        key
        , NEWID()
    FROM 
    (
        SELECT CAST(1 + (SELECT COUNT(*) FROM table))*RandMod() AS int) AS rowchoice
        FROM table
    ) AS r
    LEFT OUTER JOIN
    (
        SELECT key
            , ROW_NUMBER() OVER(ORDER BY key) AS rowid
        FROM table
    ) AS t
    ON rowchoice = rowid
    ORDER BY 2