Search code examples
sqlsql-serversql-server-2014correlated-subquery

SQL Server - Correlation with random values?


Simplified example of my situation:

I have a table with three columns: ID, CATEGORY, and TIMESTAMP. Each ID and TIMESTAMP is unique, but CATEGORY is not unique.

I made this query to return a pseudo-random list of rows in the table (one per CATEGORY).

SELECT b.*
FROM
(
  SELECT MIN(RAND(ID)*100000-FLOOR(RAND(ID)*100000)) [RandomID] -- Select random identifier for each row
  FROM MYTABLE
  GROUP BY CATEGORY
) a
INNER JOIN
MYTABLE b
ON a.RandomID = (RAND(b.ID)*100000-FLOOR(RAND(b.ID)*100000))

It seems to be working fine, but my concern is that it is possible for two (or more) different ID's to correspond to the same RandomID. If this happens, then the query would return inaccurate results, since the tables are being JOINed based on the RandomID.

Is this a valid concern? If so, how can it be overcome?


P.S. Some context:

In my case, the result of this query will be used to either keep or purge some records and files on a monthly basis, so the accuracy of the query result is very important.

To be clear, it doesn't matter to me which rows are selected, as long as there is guaranteed to be one for each CATEGORY, with one condition: I want the selected rows to be "roughly" evenly distributed based on TIMESTAMP. This is why I am correlating the data from random values rather than from the TIMESTAMP. (E.g. correlating via MIN(TIMESTAMP) would result in a higher density of rows at the beginning of the month.) Considering I have thousands of categories per month, picking the rows pseudo-randomly should generally lead to evenly distributed TIMESTAMPS (which is my goal).

One more note: I want the query to be repeatable (i.e. I want it to always select the same pseudo-random values.) Therefore, solutions involving things such as newid() are not adequate.


As requested, here is sample data.

TIMESTAMP       | ID | CATEGORY
-------------------------------
10/21/19 1:00AM | 1  | A
10/21/19 2:00AM | 2  | B
10/21/19 3:00AM | 3  | A
10/21/19 4:00AM | 4  | B
10/21/19 5:00AM | 5  | A
10/21/19 6:00AM | 6  | B

One possible output (depends on exactly what RAND() chooses) is:

TIMESTAMP       | ID | CATEGORY
-------------------------------
10/21/19 3:00AM | 3  | A
10/21/19 6:00AM | 6  | B

It doesn't particularly matter which rows are chosen, as long as there is one for each category. Again, I don't want to correlate based on TIMESTAMP because this guarantees that I would choose the first two rows in the sample data, but I want to have the TIMESTAMPS roughly evenly distributed.


Solution

  • Performance can be a problem with this method.

    declare @mytable table (timestamp datetime, ID int, category varchar(150))
    
    insert into @mytable
    values ('10/21/19 1:00AM', 1, 'A'),
        ('10/21/19 2:00AM', 2, 'B'),
        ('10/21/19 3:00AM', 3, 'A'),
        ('10/21/19 4:00AM', 4, 'B'),
        ('10/21/19 5:00AM', 5, 'A'),
        ('10/21/19 6:00AM', 6, 'A'),
        ('10/21/19 7:00AM', 7, 'A'),
        ('10/21/19 8:00AM', 8, 'A'),
        ('10/21/19 9:00AM', 9, 'A'),
        ('10/21/19 10:00AM', 10, 'A'),
        ('10/21/19 11:00AM', 11, 'A'),
        ('10/21/19 12:00AM', 12, 'A'),
        ('10/21/19 1:00PM', 13, 'A'),
        ('10/21/19 2:00PM', 14, 'A'),
        ('10/21/19 3:00PM', 15, 'A'),
        ('10/21/19 4:00PM', 16, 'A'),
        ('10/21/19 5:00PM', 17, 'A'),
        ('10/21/19 6:00PM', 18, 'A'),
        ('10/21/19 7:00PM', 19, 'A'),
        ('10/21/19 8:00PM', 20, 'A'),
        ('10/21/19 6:00PM', 21, 'B')
    
    select timestamp, id, category
    from (
       select *, row_number() over (partition by category order by newid()) rown
       from @mytable
    ) a
    where rown=1
    

    I think you could also use your random code. I don't know how the 2 methods would compare for distribution. EDIT I added ID to the order. This makes the result repeatable even in the (very) unlikely event that the random code has a collision.

    ...
    select timestamp, id, category
    from (
       select *, row_number() over (partition by category order by RAND(ID)*100000-FLOOR(RAND(ID)*100000),ID) rown
       from @mytable
    ) a
    where rown=1