Search code examples
pythonnetworkxspatialgeopandas

NetworkX find largest fully connected subset of points


I am new to graph theory and am trying to find the largest subset of x,y coordinate points in which all points in the subset are within a specified distance of one another. I am currently using the nx.k_core(G) function in the following code:

import networkx as nx
import geopandas as gpd
import numpy as np

# Initialize graph
G = nx.Graph()

# Create nodes
for idx, point in enumerate(gdf.geometry):
    G.add_node(idx, pos=(point.x, point.y))

# Make connections based on 3-mile threshold
threshold = 3 * 5280
for i in range(len(gdf)):
    for j in range(i + 1, len(gdf)):
        point1 = gdf.geometry.iloc[i]
        point2 = gdf.geometry.iloc[j]
        distance = point1.distance(point2)
        if distance < threshold:
            G.add_edge(i, j, weight=distance)

# Find subset of fully connected points
core_g = nx.k_core(G)
gdf_subset = gdf.iloc[np.array(core_g.nodes())]

# Check distances
distances = gdf_subset .geometry.apply(lambda geom: gdf_subset .distance(geom))

distances = [x for x in distances.to_numpy().flatten() if x >threshold]
print(distances)

Edit: The following is an example of the gdf where the function does work to find the subset:

x = [7014374.9196544 , 7004008.53036228, 7004042.49746631,
    7019567.77240716, 7018905.07998047, 7000518.63143169,
    7001300.93303323, 6998409.97803232, 6998506.74716872,
    6998447.48736545]
y = [1153370.50279499, 1165484.07580641, 1163334.84053259,
     1165224.27028039, 1165504.92948242, 1159125.54067565,
     1158846.3225193 , 1160039.91516453, 1157673.01842574,
     1157635.65641456]

geometry = gpd.points_from_xy(x=x, y=y)
gdf = gpd.GeoDataFrame(geometry=geometry,crs="EPSG:2227")

But using a much larger dataset of x,y points, it does not properly filter the points:

x=[6976758.463773819, 6977093.184559382, 6976797.36470676, 6972610.464597752, 6974361.340182712, 6982474.642210263, 6986062.801006353, 6970849.624019324, 6968166.810391851, 6966362.750706404, 6969184.980095293, 6972262.814811845, 6966844.204922337, 6966888.478964512, 6967244.422170041, 6969614.686502094, 6990356.158245409, 6985244.4455984095, 6987939.80134037, 6981684.035262955, 6981620.091957918, 6977194.833151057, 6973647.5580193, 6972401.219414424, 6972181.1010827515, 6977545.751431958, 6977276.477270048, 6977279.180730501, 6982665.23935931, 6983265.274714659, 6987455.12297157, 6988951.103273708, 6988651.907310768, 6992239.454862614, 6988124.517805978, 6988280.244704331, 6987825.826193168, 6987825.826193168, 6977950.128964132, 6972212.611515398, 6983125.573658108, 6983462.728727658, 6975674.048946035, 6977768.29780711]

y=[1167471.315568185, 1165071.4983622595, 1164848.4903694615, 1164786.6823113142, 1167763.77605829, 1165261.2273441725, 1165351.8601757414, 1162465.3963376773, 1161806.8777149902, 1162436.5792886976, 1153514.1650730975, 1147692.7897634686, 1149836.5013905086, 1146776.4322807193, 1147000.2090415196, 1144556.914447512, 1158457.9077438777, 1160129.0363665363, 1159951.3434887768, 1158144.228039553, 1162406.3500516259, 1158222.9603871382, 1155292.0210851224, 1152540.9439513667, 1153266.4518821521, 1152726.215971334, 1152722.2165656704, 1152540.0718662178, 1152584.123681064, 1152483.8434239987, 1152474.3879562814, 1152497.1834639476, 1152492.6179683174, 1152729.7582206137, 1147820.598499059, 1147422.158551845, 1147779.6127970135, 1147779.6127970135, 1147667.4658725595, 1151117.1373719363, 1141878.4505136542, 1139369.3168056472, 1139653.9060171435, 1139757.820359667]
geometry = gpd.points_from_xy(x=x, y=y)
gdf = gpd.GeoDataFrame(geometry=geometry,crs="EPSG:2227")

What I need is the largest subset of points in which each point is connected to all others in the subset (via the 3-mile threshold). Is there a better approach to achieve this? Thanks!


Solution

  • The K-core is not "the largest subset of x,y coordinate points in which all points in the subset are within a specified distance of one another". It is the maximal subgraph that contains nodes of degree k or more. In networkx's implementation of k_core, the default k=None will yield the subgraph with the largest core number.

    You rather want to find a clique, specifically, the max_clique:

    nodes = nx.approximation.max_clique(G)
    # {0, 1, 2, 3, 4, 5, 6, 7, 17, 19, 20, 21, 25, 26, 27}
    

    Checking the distances:

    nodes = nx.approximation.max_clique(G)
    gdf_subset = gdf.iloc[list(nodes)]
    
    # Check distances
    distances = gdf_subset .geometry.apply(lambda geom: gdf_subset .distance(geom))
    
    distances = [x for x in distances.to_numpy().flatten() if x >threshold]
    print(distances) # []