Search code examples
postgresqlgenerate

Generate data with at least one occurence


I have three tables:

create table genres
(
genre_id serial primary key,
genre_name varchar NOT NULL UNIQUE
);

create table movies
(
movie_id serial primary key,
movie_name varchar NOT NULL           
);

create table movie_genres
(
movie_id integer references movies NOT NULL,
genre_id integer references genres NOT NULL, 
PRIMARY KEY(movie_id, genre_id)
);

Tables genres and movies are full of data and I want to generate some random data for table movie_genres, so that every movie has at least one genre. I tried it this way, but then it is possible for a movie to be without any genre. Can anyone help me with that, please?

insert into movie_genres 
select movie_id, genre_id 
from genres cross join movies 
where random() < 0.15;

Solution

  • Hmm, you can try to join a derived table in which you first select one random genre and then UNION some more randomly.

    INSERT INTO movie_genres 
            (movie_id,
             genre_id)
    SELECT m.movie_id,
           rg.genre_id
           FROM movies m
                CROSS JOIN ((SELECT g.genre_id
                                    FROM genres g
                                    ORDER BY random()
                                    LIMIT 1)
                            UNION
                            (SELECT g.genre_id
                                    FROM genres g
                                    WHERE random() < 0.15)) rg;
    

    That however means that every movie has that one genre selected first. To overcome this and have the first genre be random per movie, a lateral join can be used. (Remark: You need to use some column from the outer table in the derived table as otherwise the optimizer seems to optimize the LATERAL away.)

    INSERT INTO movie_genres 
                (movie_id,
                 genre_id)
    SELECT rg.movie_id,
           rg.genre_id
           FROM movies m
                CROSS JOIN LATERAL ((SELECT g.genre_id,
                                            m.movie_id -- that's just here to force the optimizer to keep the join lateral
                                            FROM genres g
                                            ORDER BY random()
                                            LIMIT 1)
                                    UNION
                                    (SELECT g.genre_id,
                                            m.movie_id
                                            FROM genres g
                                            WHERE random() < 0.15)) rg;
    

    db<>fiddle