Search code examples
sqlpostgresqlrandomforeign-keysdata-generation

PostgreSQL -- how to generate random number of rows of fake data with foreign key constraints?


Note: I tried searching for other questions and answers that address this, but I couldn't find one that really corresponded to my problem and presented a full solution.

I'm trying to generate random, synthetic data to test my database schema, using SQL. While generating a bunch of random values in PostgreSQL is easy enough with random(), generating random data sets that preserve the constraints and characteristics of the data I'm expecting to see is not. Specifically, I have the following tables:

CREATE TABLE suites(
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT
);

INSERT INTO suites(name)
SELECT 'suite' || g FROM generate_series(1,50) g;

CREATE TABLE tests(
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT
);

INSERT INTO tests(name)
SELECT 'test' || g FROM generate_series(1,100) g;

CREATE TABLE tests_in_suites(
suite_id BIGINT,
test_id BIGINT,
PRIMARY KEY (suite_id, test_id)
);

DB Fiddle

I want to populate tests_in_suites with random values, such that each suite contains a random number (between 3 and 7) of tests, selected uniformly from tests. I want the selection to be random and uniform, and avoid cycles and other repeating patterns. I have tried the following approach:

SELECT s.id, t.id FROM
(select id from suites) s,
(SELECT id FROM tests ORDER BY random() LIMIT 2 + ceil(random() * 5)) t
ORDER BY s.id, t.id;

DB Fiddle

But it always chooses the same number of tests for each suite and the tests selected are identical, because the optimiser replaces the subquery for s with a constant. I tried introducing a dependency on the current suite being considered, but it complains the value I'm trying to use is not accessible:

SELECT s.id, t.id FROM
(select id from suites) s,
(SELECT id FROM tests ORDER BY random() LIMIT 2 + ceil(random() * 5 + s.id*0)) t
ORDER BY s.id, t.id;

ERROR:  invalid reference to FROM-clause entry for table "s"
LINE 3: ...s ORDER BY random() LIMIT 2 + ceil(random() * 5 + s.id*0)) t
                                                             ^
HINT:  There is an entry for table "s", but it cannot be referenced from this part of the query.

DB Fiddle

How can I generate my random data without falling victim to either the optimiser or invalid data dependencies in my query?


Solution

  • I want to populate tests_in_suites with random values, such that each suite contains a random number (between 3 and 7) of tests, selected uniformly from tests

    This sounds like a nice use case for a lateral join...

    INSERT INTO tests_in_suites(suite_id,test_id)
    SELECT suites.id, t.id
    FROM suites
    CROSS JOIN LATERAL (SELECT id, suites.id AS lol FROM tests ORDER BY random() LIMIT (3+random()*4)) t;
    

    Lateral join recomputes the joined table for each row of the table on the left side of the join, which is what we want here. But postgres will optimize this away if the joined table subquery looks like it is not a dependent subquery after all. So I added suites.id in the joined table to make it look like the joined table is indeed dependent on the row from table suites.

    There's probably a way to do it with array_agg() and unnest() too.