Search code examples
postgresqlrandommockinglateral-join

Selecting random IDs from another table....confused about LATERAL JOIN


I'm trying to get a handle on generating random data in Postgres, and find I'm misunderstanding something about LATERAL JOIN. Building on some help I got earlier, I've got some code that's trying to:

-- Generate a series of numbers
-- Generate timestamps on the hour matching the number sequence
-- Generate a random(ish) score in a roughly normal distribution for each row
-- Pick an ID randomly from a table in the database.

This last bit is not working. When I run the script shown below, I get a random value for the facility_id, but every row has the same random value. I'd like the random ID assigned on each row, not once globally for the whole run. In procedural thinking, the facility_id is being assigned before the loop, and I want it assigned in the loop. I thought that LATERAL JOIN would help me out here, but

WITH facilities_count AS
(SELECT count(*) from facility)

 SELECT hour_number AS id, -- Get numbers in sequence.
       '2019-01-01 00:00'::timestamp + interval '1 HOUR' * hour_number AS stamp, -- Get hours in sequence
        ABS(TRUNC(normal_rand(1, 0, 1) * 100)) AS score, -- Create a random score in a ~normal distribution.
       random_facility.id

 FROM (SELECT * FROM generate_series(1,8760,1) AS hour_number) generated_numbers 

 LEFT JOIN LATERAL
      (SELECT id
          FROM facility 
        OFFSET floor(random() * (select count from facilities_count))
         LIMIT 1) random_facility
            ON true;

I thought that a subquery might work, but I also get a single value across all rows for the facility_id with this code:

WITH facilities_counter AS
(SELECT count(*) from facility)

 SELECT hour_number AS id, -- Get numbers in sequence.
       '2019-01-01 00:00'::timestamp + interval '1 HOUR' * hour_number AS stamp, -- Get hours in sequence
        ABS(TRUNC(normal_rand(1, 0, 1) * 100)) AS score, -- Create a random score in a ~normal distribution.
         (SELECT id FROM facility OFFSET floor(random() * (select count from facilities_counter)) LIMIT 1)

 FROM (SELECT * FROM generate_series(1,8760,1) AS hour_number) generated_numbers;

I haven't listed the facility table definition, but the only field that matters above is id, so any table would work the same way.

In case it makes any difference to the answer, once I can figure out the solution to this problem, I'd like to then use the random facility_id on each row as an input to select something else out of another table.

Thanks for any help. I'm working on this not just to get the solution, but to try and get a better mental model of how the various tools work. I'm (obviously) not at the point where I can read code like the above and predict in my head how it will behave. Getting to that understanding is kind of fundamental to figuring out how to solve problems on my own. Put another way, I'm not only trying to solve this problem, I'm trying to reduce my mental gaps generally.

Premature Optimization

I'm editing my question as I need to post more code.

Ha! Glad to see I'm not the only one accused of having premature optimization problems ;-) That's something that can be hurled in anger in a design or review session, to be sure.

I think what you're showing is that the id is used in an expression, which makes its outcome indeterminate. And here's where I live and relearn that I I should post my table structure. Our id fields are UUIDs. I've attempted to hack together something that would use the UUID in an expression, but it does not change the behavior.

where left(id::text,1) <> 'X' -- prevent premature optimization

The premature optimization you describe, and the workaround to defeating it, are not behaviors I could predict. Because gaps. I've tried running my code with the left expression through explain (costs off), and I get this kind of output:

  CTE facilities_counter
    ->  Aggregate
          ->  Seq Scan on facility
  InitPlan 3 (returns $2)
    ->  Limit
          InitPlan 2 (returns $1)
            ->  CTE Scan on facilities_counter
          ->  Seq Scan on facility facility_1
                Filter: ("left"((id)::text, 1) <> 'X'::text)
  ->  ProjectSet
        ->  Function Scan on generate_series hour_number```

I'm unclear from the output how I would distinguish the behavior you describe, which is still present here (?)

Solution

  • This is a premature optimization problem. Postgres forsees that the subquery would return a constant value and optimizes it away.

    A solution is to force it to execute the subquery for each record, by adding some condition that will always evaluate as true but that Postgres will not understand as such.

    Consider:

    with facilities_counter as (select count(*) from facility)
    select 
        hour_number as id, -- get numbers in sequence.
        '2019-01-01 00:00'::timestamp + interval '1 hour' * hour_number as stamp,
        abs(trunc(normal_rand(1, 0, 1) * 100)) as score,
        (
            select id 
            from facility 
            where id <> -1 * hour_number   -- prevent premature optimization
            offset floor( random() * (select count from facilities_counter)) 
            limit 1
        )
    from (
        select * from generate_series(1,8760,1) as hour_number
    ) generated_numbers;
    

    Demo on DB Fiddle