Is there any way in SQL to take a random sample of N rows (or M% if necessary) rows from a linked external data source, using the Azure Synapse Analytics serverless SQL pool?
Cryptographic functions are not available in the serverless SQL pool, so basically I can't use RAND()
or CHECKSUM(NEWID())
, e.g. to define a condition to filter rows by comparing to my desired value of N (or M):
FUNCTION 'RAND' is not supported.
I can use HASHBYTES
against a field in my external data source, but my table doesn't contain a unique ID per row.
SELECT title, director, HASHBYTES('sha2_256', title)
FROM external_table
Ideally the sampling should be statistically sound, rather than relying on some implementation detail of the database (like TABLESAMPLE
would, which anyway is also not available in the serverless pool).
Expanding on the comment from @GregGalloway
The query would look something like this:
WITH [cte_sample] as (SELECT
[title],
[director],
ROW_NUMBER() OVER (ORDER BY [ProductKey]) % 20 as [Sample_Group]
FROM
[external_table])
SELECT
[title],
[director]
FROM
[cte_sample]
WHERE
[Sample_Group] = 1
Used % 20 to give a 5% sample size.