I'm a little compute-constrained because I don't want to kill the database or bandwidth while pulling for hours.
my current query is something like this:
SELECT TOP (100000000) mycolumns
FROM mydb
WHERE lower_control_limit> value1
AND upper_control_limit < value2 ;
I am not getting a decent random sample, but I am afraid of putting commands that have to traverse billions of rows before returning values. I need a uniform random sample over the domain (rows).
What is the "best practice" for pulling a few hundred million rows from a store that is many billions of rows in size?
I've seen the random, but I'm just worried about the size of the query vs. limited computing and bandwidth resources.
Found this: Select n random rows from SQL Server table
SELECT TOP (100000000) mycolumns
FROM mydb order by newid()