I hope you all can help me.
What I'm trying to do
I'm trying to take a random sample from a large Azure database so that I can run the files locally before deploying on Azure cloud. The steps are to first create a pseudorandom number, then just take the first X rows or first X% of rows.
What I've tried
I've read several posts including stack overflow SQL, stack overflow SQL 2, SQL select, and USQL Order By Fetch, but still have not figured the syntax out.
Code:
//ATTEMPT 1
SELECT * FROM @searchlog
FETCH FIRST 3 ROWS ONLY;
//ATTEMPT 2
@outsearchlog =
SELECT *
FROM @searchlog
ORDER BY NEWID() DESC FETCH 10;
//ATTEMPT 3
@outsearchlog =
SELECT *,
NEWID() AS newid
FROM @searchlog;
//ATTEMPT 4
@outsearchlog =
SELECT *,
newid() AS newid
FROM @searchlog;
//ATTEMPT 5
@outsearchlog =
SELECT *,
newid() AS newidwoot
FROM @searchlog;
//ATTEMPT 6
@outsearchlog =
SELECT *,
Random() AS newidwoot
FROM @searchlog;
U-SQL has a SAMPLE
operator so just add it to the bottom of your statement. For example, this code generates a 10% uniform sample:
@outsearchlog =
SELECT *
FROM @searchlog
SAMPLE UNIFORM (0.1);
Sample is documented here. You do not need an additional row number if you use this approach. If you do feel you need a row number then use ROW_NUMBER()
.
I would question the use of U-SQL in the year 2021, particularly for new projects. ADLS Gen 1 has been deprecated (retiring Feb 2024) and as you know, U-SQL is not compatible with ADLS Gen 2. So you should endeavour to convert your projects to use Azure Databricks or Azure Synapse Analytics or other suitable platform as soon as possible.
https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
If you do consider switching to Azure SQL DB or Azure Synapse Analytics then that supports the TABLESAMPLE
command for similar effect:
SELECT *
FROM yourTable
TABLESAMPLE( 10 PERCENT );