I have two tables. What I want is a random sample from all the possible pairings. Say size of t1 is 100, and size of t2 is 200, and I want a sample of 300 pairings. The naive way of doing this (ran on the online duckdb shell) is:
CREATE TABLE t1 as FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet' LIMIT 100;
CREATE TABLE t2 as FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/lineitem.parquet' LIMIT 200;
SELECT * FROM t1, t2 USING SAMPLE 300;
But an EXPLAIN
reveals that this results in a full cross product, followed by a sample. For actually large tables this is intractable.
I tried reversing the order of operations, eg a shuffle followed by a limit, with
SELECT * FROM (FROM t1 ORDER BY RANDOM()) AS t1, (FROM t2 ORDER BY RANDOM()) AS t2 LIMIT 300;
and this is smarter, and is able to do it streaming according to EXPLAIN
. But, this sampling of pairs is biased, because in duckdb's implementation of the cartesian product (and I'd guess most engines are the same, in fact I think it's required if you want to stream) it is like a nested for loop, where the first record in t1 is paired with all the records in t2, before moving on to the second record in t1, etc. So some records in t1 are "oversampled".
So here's my idea: let's sample as many records as we need from each table, in a random order, streamed, and re-sampling the same table over and over again as needed (because we need 300 pairs, and each source table is only length 100 or 200). Then pair these streams together and zip them up. In pseudocode:
WITH
t1_generator AS (
FROM t1 ORDER BY RANDOM()
UNION ALL
FROM t1 ORDER BY RANDOM()
UNION ALL
FROM t1 ORDER BY RANDOM()
-- etc.
),
t2_generator AS (
FROM t2 ORDER BY RANDOM()
UNION ALL
FROM t2 ORDER BY RANDOM()
UNION ALL
FROM t2 ORDER BY RANDOM()
-- etc.
),
s1 AS (
SELECT *, ROW_NUMBER() OVER () AS __rn FROM t1_generator
),
s2 AS (
SELECT *, ROW_NUMBER() OVER () AS __rn FROM t2_generator
),
zipped AS (
SELECT * EXCLUDE __rn FROM s1 JOIN s2 ON s1.__rn = s2.__rn
)
SELECT * FROM zipped
LIMIT 300
Except t1_generator
and t2_generator
are hardcoded in, and you need to calculate how many UNION ALL
s you need, which is ugly. It also does the sampling WITHOUT replacement (ie round robin style, it goes through all of t1 before revisiting), when really I want WITH replacement. I tried stuff with recursive CTEs but those don't work to reshuffle the data on every "round".
Any thoughts on how to accomplish this? I'm using duckdb if that's important. Thanks!
I'm going to assume you have an index on both tables -- on table A it goes from 1 to N and on table B it goes form 1 to M
Then you just need to generate 300 pairs of random tuples (R(1..N), R(1..M)) Store that in a table and then join table A to the first column and table B to the 2nd column
With large tables this would be much faster than sampling from the cross join