Search code examples
sqlpostgresqlpostgis

query to get the results of each row in table1 with a subquery of N maximum records found to meet a condition in table2


I am trying without success to calculate building heights in my city using the LIDAR satellite dataset.

System specs

  • CPU: Core i7 6700k 4200MHz, 4 cores, 8 threads
  • RAM: 32GB DDR4 3200mhz
  • SSD: 1TB Samsung 970 EVO
  • OS: Ubuntu 18.04

Postgres setup

I am using the latest version of Postgres v12.1 database with PostGIS with the following tweaks recommended in different sources:

 shared_buffers = 256MB 
 maintenance_work_mem = 4GB
 max_parallel_maintenance_workers = 7
 max_parallel_workers = 7
 max_wal_size = 60GB
 min_wal_size = 90MB
 random_page_cost = 1.0

Database setup

  • In the lidar table I have more than 3000 million rows, and in the buildings table more than 150000 rows.
  • In the lidar table the GiST index was created: CREATE INDEX lidar_idx ON lidar USING GIST (geom);
  • building table: | gid | geom |
  • lidar table: | z | geom |

Height calculation

Currently in order to calculate the height of a building, it is necessary to check if each one of the 3000 million points (rows) is inside the area of each building and calculate the average of all the points found inside a building area.

The queries I have tried are taking forever (probably more than 5 days or even more) and I would like to simplify the query so that I can get the height of the building with a lot less points, without having to compare with all the insane 3000 million records each time for each building.

In example:

  • For building with id1, I would like to get only the first 100 records found which are inside the building geometry area ( ST_Within(l.geom, e.geom) ), and once those 100 records are found, pass to the next building.
  • For building with id2, I would like the same, get only the first 100 records found which are inside the building area.
  • And so on..

My main query is

SELECT e.gid, AVG(l.z) AS height 
FROM lidar l, 
     buildings e 
WHERE ST_Within(l.geom, e.geom) 
GROUP BY e.gid) t

I have tried with another query, but I can not get it to work.

SELECT e.gid, AVG(l.z), COUNT(1) FILTER (WHERE ST_Within(l.geom, e.geom)) AS gidc
FROM lidar l, buildings e
WHERE gidc < 100
GROUP BY e.gid

Solution

  • I don't think you really want to do this at all. You should first try to make the correct query faster rather than compromising correctness by working with an arbitrary (but not random) subset of the data.

    But if you do want it, then you can use a lateral join.

    SELECT e.gid from 
        buildings e cross join lateral 
        (select AVG(l.z) AS height FROM lidar l WHERE ST_Within(l.geom, e.geom) LIMIT 100) 
    

    it is necessary to check if each one of the 3000million points (rows) is inside the area of each building and calculate the average of all the points found inside a building area.

    This is exactly what a geometry index is for. You don't need to look at every point to get just the ones inside the a building area. If you don't have the right index, such as on lidar using gist (geom), then the lateral join query will also be awful.