Search code examples
pythonpandasalgorithmcluster-analysis

Clustering the data based on multiple attributes (in python)


I have a dataset describing various cities and some attributes defining each city. I have more or less 10,000 cities and roughly 10 different attributes, some have values from 0-1, some are just numbers that represent the attribute (for example how many parks or hospitals there are in a given city).

I want to cluster this data, so cities that have similar values of attributes (OVERALL) are clustered together.

I was thinking to use k-means algorithm, but I am not sure if that's the best option for my problem/dataset, given that there are also hierarchical clustering and spatial clustering techniques.

Would someone have a recommendation on which alg would be the best to use in this particular case and how to do it quickly in python?

First 10 rows of the 10,000 row dataset look like this (more or less):

columns = ['city_id', 'attribute1', 'attribute2', 'attribute3', 'attribute4', 'attribute5', 'attribute6', 'attribute7', 'attribute8', 'attribute9']
data = [[0, 20, 45, 0.15, 0.04, 12, 1, 2, 10, 0.02],
        [1, 12, 35, 0.12, 0.03, 10, 0, 4, 5, 0.04],
        [2, 14, 28, 0.09, 0.01, 8, 1, 5, 4, 0.05],
        [3, 5, 17, 0.08, 0.02, 6, 1, 10, 3, 0.01],
        [4, 35, 36, 0.04, 0.02, 5, 1, 3, 15, 0.035],
        [5, 2, 12, 0.13, 0.04, 7, 0, 4, 13, 0.044],
        [6, 23, 52, 0.19, 0.04, 14, 0, 5, 9, 0.057],
        [7, 42, 29, 0.04, 0.05, 9, 1, 2, 7, 0.024],
        [8, 9, 34, 0.21, 0.07, 10, 1, 6, 15, 0.017],
        [9, 4, 41, 0.22, 0.03, 2, 0, 8, 11, 0.065]]
        
df = pd.DataFrame(data, columns=columns)
df

Solution

  • This actually sounds like a prototypical case for k-means clustering. Make sure to normalize the values of the different attributes beforehand though, in order to have the clustering apply equal weight to each.

    You could do sth like

    from sklearn import preprocessing
    from sklearn.cluster import KMeans
    x = df.values
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    kcluster = KMeans(n_clusters=2).fit(x_scaled) #for 2 Clusters
    

    You can then access the clustering with the object returned by KMeans.fit, e.g.

    kclust.labels_
    

    Shows the labels