Search code examples
sqlpostgresqlcluster-analysispostgis

Clustering geometries recursively exceeds cluster size limit


I want each cluster to have a maximum of 20 items. Here is my code in PostgreSQL with PostGIS extension:

WITH RECURSIVE clustered_data AS (-- Step 1: Perform initial clustering
    SELECT pma.*
         , ST_ClusterKMeans(ST_Transform("centerPoint", 24313), 1, 20000) 
             OVER (PARTITION BY "areaID" ORDER BY "ID") AS "nta_cluster_id" -- Initial clustering
         , 1 AS "recursion_depth" -- Initialize recursion depth
    FROM "potential_missed_areas" pma 
    WHERE pma."campaignid" = -1
      AND pma."status" = 'approved'
),cluster_sizes AS (-- Step 2: Calculate cluster sizes
    SELECT "nta_cluster_id"
         , COUNT(*) AS "cluster_size"
    FROM clustered_data
    GROUP BY "nta_cluster_id"
),recursive_clusters AS (-- Base case: Start with clusters from previous CTE
    SELECT cd.*--column list skipped for brevity
         , cd."nta_cluster_id" -- Include nta_cluster_id
         , cs."cluster_size" -- Include cluster size from the previous part
         , cd."recursion_depth" -- Include recursion_depth from base case
    FROM clustered_data cd
    JOIN cluster_sizes cs ON cd."nta_cluster_id" = cs."nta_cluster_id"
    UNION ALL-- Recursive case: Re-cluster if any cluster has more than 20 items
    SELECT rc.* --column list skipped for brevity
         , rc."nta_cluster_id" -- Include nta_cluster_id
         , cs."cluster_size" -- Include cluster size from the previous part
         , rc."recursion_depth" + 1 AS "recursion_depth" -- Increment recursion depth
    FROM recursive_clusters rc
    JOIN cluster_sizes cs ON rc."nta_cluster_id" = cs."nta_cluster_id"
    WHERE cs."cluster_size" > 20 -- Only re-cluster if there are more than 20 items
      AND rc."recursion_depth" < 100 -- Limit recursion depth to avoid infinite loops
),dataset as (-- Final selection
SELECT *
     , ST_ClusterKMeans(ST_Transform("centerPoint", 24313), GREATEST(1, CEIL("cluster_size" / 20.0)::int)) 
         OVER (PARTITION BY "areaID") AS "new_cluster_id" -- Re-cluster based on size
FROM recursive_clusters
WHERE "recursion_depth" <= 100
)SELECT "areaID"
      , new_cluster_id
      , count(*)
 FROM dataset
 GROUP BY 1,2;

I observed that the counts for the majority of the rows are significantly greater than 20, specifically returning values like 900, 100, and 200.

This discrepancy raises some concerns, as such high counts may indicate potential issues within the recursive logic utilized in the query. I’m particularly interested in delving deeper into whether this recursive approach is functioning as intended or if it might be introducing inaccuracies into the results.

Additionally, I would appreciate insights on alternative methods or strategies that could achieve the same objectives, potentially yielding more reliable and accurate counts.


Solution

    1. order by in a window function call implies range between unbounded preceding and current row frame. That means each of the shapes only sees itself and shapes with lower ID. Even if that doesn't cause problems, it needlessly complicates the operation. Remove the order by from the window definition.
    2. Your recursive query does zero recursion. It just repeats all clusters above the size limit, 98 times, incrementing their recursion_depth and that's it - you could've done that without recursive, joining to generate_series(2,99,1).
    3. On-the-fly ST_Transform() slows things down and disables index support.

    My bet is you overlooked that in point 2 above, you multiplied your set by cloning the same shapes in the large clusters, over and over, 98 times, for no obvious reason. It'll come out larger than it came in, and overall count(*) as well as individual cluster size will go up accordingly.

    You can count(*) your input and compare that to a total count(*) on your output. In this test I generated 1005 points within the boundaries of SRID 24313 (in and around Pakistan) and your query turned those 1005 points into 95847.

    Here's what I think you tried to do:
    demo at db<>fiddle

    DROP TABLE IF EXISTS clustered_points;
    CREATE TABLE clustered_points AS
    WITH clustered_data AS (-- Step 1: Perform initial clustering
        SELECT pma.*
             , ST_ClusterKMeans(ST_Transform("centerPoint", 24313), 1, 2e4) 
               OVER (PARTITION BY "areaID") AS "nta_cluster_id" -- Initial clustering
        FROM "potential_missed_areas" pma 
        WHERE pma."campaignid" = -1
          AND pma."status" = 'approved'
    ),cluster_sizes AS (-- Step 2: Calculate cluster sizes
        SELECT "areaID"
             , "nta_cluster_id"
             , COUNT(*)::int AS "cluster_size"
        FROM clustered_data
        GROUP BY 1,2
    ),dataset as (-- Final selection
    SELECT *
         , ST_ClusterKMeans( ST_Transform("centerPoint", 24313)
                            ,GREATEST(1,CEIL("cluster_size" / 20.0))::int) 
             OVER (PARTITION BY "areaID", "nta_cluster_id") AS "new_cluster_id" -- Re-cluster based on size
    FROM clustered_data 
    JOIN cluster_sizes USING("areaID","nta_cluster_id")
    )SELECT *
          , format( '"areaID":%s,"nta_cluster_id":%s,"new_cluster_id":%s'
                   ,"areaID","nta_cluster_id","new_cluster_id") as "full_cluster_identifier"
         , rank()over(order by "areaID","nta_cluster_id","new_cluster_id") as "final_cluster_id"
    FROM dataset;
    

    Also useful for inspection:

    drop table if exists cluster_shapes;
    create table cluster_shapes as 
    select min("full_cluster_identifier") as "full_cluster_identifier"
         , min("final_cluster_id") as "final_cluster_id"
         , min("cluster_size") as "cluster_size"
         --, st_optimalalphashape(st_collect("centerPoint")) as geom
         , st_convexhull(st_collect("centerPoint")) as geom
    from clustered_points
    group by "final_cluster_id"
    having min("cluster_size")>5;
    

    This doesn't multiply the set and performs your two-phase cluster correctly. What I think you overlooked is that you partitioned by "areaID" in the initial step, but then you forgot to involve it in the group by when counting the sizes in cluster_sizes. This lead to different clusters in different areas being counted together because their numbers are re-used in different areas.

    Cluster 1 in area 1 has the same number, but isn't related to cluster 1 in area 5. If you only group by "nta_cluster_id", it'll mix them together and the next step will see incorrect sizes.

    Here are some clusters formed using that on 200k random points: enter image description here On a surface area this large with uniformally distributed points, 20km radius is too short to form sizeable clusters, and I really needed a lot on input for even a few to connect.