What is the best query to sample from Impala for a huge database?

I have a huge table with more than 1 billion rows in Impala. I need to sample approximately 100,000 rows multiple times. What is the best way to query these sample rows?

Solution

As Jeff mentioned, what you've asked for exactly isn't possible yet, but we do have an internal aggregate function which takes 200,000 samples (using reservoir sampling) and returns the samples, comma-delimited as a single row. There is no way to change the number of samples yet. If there are fewer than 200,000 rows, all will be returned. If you're interested in how this works, see the implementation of the aggregate function and reservoir sampling structures.

There isn't a way to 'split' or explode the results yet, either, so I don't know how helpful this will be.

For example, sampling trivially from a table with 8 rows:

> select sample(id) from functional.alltypestiny
+------------------------+
| sample(id)             |
+------------------------+
| 0, 1, 2, 3, 4, 5, 6, 7 |
+------------------------+
Fetched 1 row(s) in 4.05s

(For context: this was added in a past release to support histogram statistics in the planner, which unfortunately isn't ready yet.)