Search code examples
pythongisgeospatialgeopandas

Draw polygons around a set of points and create clusters in python


I have a Pandas DataFrame containing Lat, Long coordinates. How do I draw non-overlapping polygons around a cluster of points and aggregate the geometries in a geopandas DataFrame. Below is sample code to work with:

import pandas as pd
import numpy as np
import geopandas as gpd

df = pd.DataFrame({
                   'yr': [2018, 2017, 2018, 2016],
                   'id': [0, 1, 2, 3],
                   'v': [10, 12, 8, 10],
                   'lat': [32.7418248, 32.8340583, 32.8340583, 32.7471895],
                   'lon':[-97.524066, -97.0805484, -97.0805484, -96.9400779]
                 })

df = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['Long'], df['Lat']))

# set crs for buffer calculations
df.set_crs("ESRI:102003", inplace=True)

The Polygons can be of any shape, however, must include a minimum of 5 points. I tried creating a buffer around the points but circle is not the ideal solution. I am looking for a way to draw a more flexible polygon.

This polygon representation will be added as a new column to the pandas dataframe containing the points.

https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.buffer.html

enter image description here


Solution

    • your question and sample data make no sense! You say you want clusters of 5 points or more and only provide 4 points. Leaving person who answers this question mandated to find some data. Better practice is to generate a MWE of what you've tried which can possibly become solution you want. Have used UK hospitals to get some data with lat / lon
    • from your other scatter gun questions, it's clear you have tried using geohash as a solution. Let's explore this
    • get geohash for each point geolib.geohash.encode()
    • aggregate points in same geohash by using dissolve() This will give a MULTIPOINT geometry. Convert this to POLYGON using convex_hull
    • now have polygons that do not overlap and have clusters of points. It doesn't ensure that a cluster has a minimum of 5 points

    enter image description here

    import requests, io
    import pandas as pd
    import numpy as np
    import geopandas as gpd
    import geolib.geohash
    import folium
    
    # get some data that meets sample with enough data
    df = (
        pd.read_csv(
            io.StringIO(requests.get("https://assets.nhs.uk/data/foi/Hospital.csv").text),
            sep="Č",
            engine="python",
        )
        .rename(columns={"Latitude": "lat", "Longitude": "lon"})
        .loc[:, ["lat", "lon"]]
    ).dropna()
    df["id"] = df.index
    df["yr"] = np.random.choice(range(2016, 2019), len(df))
    df["v"] = np.random.randint(0, 11, len(df))
    
    # get geohash so points in same area can be clustered
    df["geohash"] = df.apply(lambda r: geolib.geohash.encode(r["lon"], r["lat"], 3), axis=1)
    
    # construct geodataframe
    gdf = gpd.GeoDataFrame(
        df, geometry=gpd.points_from_xy(df["lon"], df["lat"]), crs="epsg:4386"
    )
    # cluster points to polygons
    gdf2 = gdf.dissolve(by="geohash", aggfunc={"v": "sum", "id":"count", "yr":"mean"})
    gdf2["geometry"] = gdf2["geometry"].convex_hull
    
    # let's visualise everything
    m = gdf2.explore(color="green", name="cluster", height=300, width=600)
    m = gdf.explore(column="geohash", m=m, name="popints")
    folium.LayerControl().add_to(m)
    m