I'm trying to add a "population-based-centroid" column to a series of U.S. county polygons, where the location is not based on the geographic centroid of the polygon, but rather on the location of the geonames populated place with the largest population. For example, I want to assign the geometry of the arrow-indicated point (point diameter = population) to the selected polygon's population-based-centroid column:
I've tested this query, and it returns the correct geometry for any given polygon (Boston's Suffolk County, for example):
SELECT g1.the_geom
FROM counties c1
JOIN geonames g1
ON ST_Contains(c1.the_geom, g1.the_geom)
WHERE c1.name = 'Suffolk County, MA'
ORDER BY g1.population DESC
LIMIT 1;
However, I'm dealing with ~4000 polygons, and when I try to use the query in an UPDATE function like this it hangs indefinitely (or at least far longer than it should for this number of features):
UPDATE counties
SET the_geom_popcentroid = (
SELECT g1.the_geom
FROM counties c1
JOIN geonames g1
ON ST_Contains(c1.the_geom, g1.the_geom)
ORDER BY g1.population DESC
LIMIT 1
);
Where have I nested this UPDATE function incorrectly?
It helps to use a window function here:
WITH max_pop as (
SELECT DISTINCT c.id as county_id,
first_value(g.the_geom) OVER (PARTITION BY c.name ORDER BY g.population DESC) as the_geom
FROM counties c
JOIN geonames g
ON ST_Intersects(c.the_geom,g.the_geom)
)
UPDATE counties
SET pop_center_geom=max_pop.the_geom
FROM max_pop
WHERE counties.id=max_pop.county_id;
What we're doing:
Ordering the cities in each county by population descending, then taking the geometry of the first one and the id of the county in which it resides.
We then update the county table using the id and the geometry we got.
I prefer this to DISTINCT ON method mentioned because to me it's more explicit about what's happening and relies less on "side-effects" (for lack of a better word).