Search code examples
sqlpostgresqlpostgisgreatest-n-per-grouppoint-in-polygon

Assign point with largest population from set of points contained in a polygon


I'm trying to add a "population-based-centroid" column to a series of U.S. county polygons, where the location is not based on the geographic centroid of the polygon, but rather on the location of the geonames populated place with the largest population. For example, I want to assign the geometry of the arrow-indicated point (point diameter = population) to the selected polygon's population-based-centroid column:

enter image description here

I've tested this query, and it returns the correct geometry for any given polygon (Boston's Suffolk County, for example):

SELECT g1.the_geom
FROM counties c1
JOIN geonames g1
ON ST_Contains(c1.the_geom, g1.the_geom)
WHERE c1.name = 'Suffolk County, MA'
ORDER BY g1.population DESC
LIMIT 1;

However, I'm dealing with ~4000 polygons, and when I try to use the query in an UPDATE function like this it hangs indefinitely (or at least far longer than it should for this number of features):

UPDATE counties
    SET the_geom_popcentroid = (
        SELECT g1.the_geom
        FROM counties c1
        JOIN geonames g1
        ON ST_Contains(c1.the_geom, g1.the_geom)
        ORDER BY g1.population DESC
        LIMIT 1
    );

Where have I nested this UPDATE function incorrectly?


Solution

  • It helps to use a window function here:

     WITH max_pop as (
      SELECT DISTINCT c.id as county_id,
        first_value(g.the_geom) OVER (PARTITION BY c.name ORDER BY g.population DESC) as the_geom
      FROM counties c
      JOIN geonames g
        ON ST_Intersects(c.the_geom,g.the_geom)
      )
    UPDATE counties
    SET pop_center_geom=max_pop.the_geom
    FROM max_pop
    WHERE counties.id=max_pop.county_id;
    

    What we're doing:

    Ordering the cities in each county by population descending, then taking the geometry of the first one and the id of the county in which it resides.

    We then update the county table using the id and the geometry we got.

    I prefer this to DISTINCT ON method mentioned because to me it's more explicit about what's happening and relies less on "side-effects" (for lack of a better word).