Search code examples
c#sqlazureazure-storageu-sql

U-SQL random sample of rows


I hope you all can help me.

What I'm trying to do

I'm trying to take a random sample from a large Azure database so that I can run the files locally before deploying on Azure cloud. The steps are to first create a pseudorandom number, then just take the first X rows or first X% of rows.

What I've tried

I've read several posts including stack overflow SQL, stack overflow SQL 2, SQL select, and USQL Order By Fetch, but still have not figured the syntax out.

Code:

//ATTEMPT 1
SELECT * FROM @searchlog
FETCH FIRST 3 ROWS ONLY;
//ATTEMPT 2
@outsearchlog =
    SELECT *
    FROM @searchlog
    ORDER BY NEWID() DESC FETCH 10;
//ATTEMPT 3
@outsearchlog =
    SELECT *,
           NEWID() AS newid
    FROM @searchlog;
//ATTEMPT 4
@outsearchlog =
    SELECT *,
           newid() AS newid
    FROM @searchlog;
//ATTEMPT 5
@outsearchlog =
    SELECT *,
           newid() AS newidwoot           
    FROM @searchlog;
//ATTEMPT 6
@outsearchlog =
    SELECT *,
           Random() AS newidwoot           
    FROM @searchlog;

Solution

  • U-SQL has a SAMPLE operator so just add it to the bottom of your statement. For example, this code generates a 10% uniform sample:

    @outsearchlog =
        SELECT *          
        FROM @searchlog
        SAMPLE UNIFORM (0.1);
    

    Sample is documented here. You do not need an additional row number if you use this approach. If you do feel you need a row number then use ROW_NUMBER().

    I would question the use of U-SQL in the year 2021, particularly for new projects. ADLS Gen 1 has been deprecated (retiring Feb 2024) and as you know, U-SQL is not compatible with ADLS Gen 2. So you should endeavour to convert your projects to use Azure Databricks or Azure Synapse Analytics or other suitable platform as soon as possible.

    U-SQL retired

    https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview

    If you do consider switching to Azure SQL DB or Azure Synapse Analytics then that supports the TABLESAMPLE command for similar effect:

    SELECT *
    FROM yourTable
    TABLESAMPLE( 10 PERCENT );