Search code examples
pythonpandask-meanssklearn-pandas

K-Means classification by group


I'm trying to do a K-means analysis in a dataframe like this:

    URBAN AREA  PROVINCE    DENSITY
0   1          TRUJILLO     0.30
1   2          TRUJILLO     0.03
2   3          TRUJILLO     0.80
3   1          LIMA         1.20
4   2          LIMA         0.04
5   1          LAMBAYEQUE   0.90
6   2          LAMBAYEQUE   0.10
7   3          LAMBAYEQUE   0.08

(You can download it from here)

As you can see, the df refers to different urban areas (with different urban density values) inside provinces. So, I want to do the K-means clasification by one column: DENSITY. To do so, I execute this code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

df=pd.read_csv('C:/Path/to/example.csv')

clustering=KMeans(n_clusters=2, max_iter=300)
clustering.fit(df[['DENSITY']])

df['KMeans_Clusters']=clustering.labels_
df

And I get this result, which is OK for this first part of the example:

    URBAN AREA  PROVINCE    DENSITY     KMeans_Clusters
0   1           TRUJILLO       0.30     0
1   2           TRUJILLO       0.03     0
2   3           TRUJILLO       0.80     1
3   1           LIMA           1.20     1
4   2           LIMA           0.04     0
5   1           LAMBAYEQUE     0.90     1
6   2           LAMBAYEQUE     0.10     0
7   3           LAMBAYEQUE     0.08     0

But now I want to do the k-means classification in urban areas by province. I mean, to repeat the same process inside any province. So I had tried with this code:

df=pd.read_csv('C:/Users/rojas/Desktop/example.csv')

clustering=KMeans(n_clusters=2, max_iter=300)

clustering.fit(df[['DENSITY']]).groupby('PROVINCE')

df['KMeans_Clusters']=clustering.labels_
df

but I get this message:

AttributeError                            Traceback (most recent call last)
<ipython-input-4-87e7696ff61a> in <module>
      3 clustering=KMeans(n_clusters=2, max_iter=300)
      4 
----> 5 clustering.fit(df[['DENSITY']]).groupby('PROVINCE')
      6 
      7 df['KMeans_Clusters']=clustering.labels_

AttributeError: 'KMeans' object has no attribute 'groupby'

Is there a way to do so?


Solution

  • try this

    def k_means(row):
        clustering=KMeans(n_clusters=2, max_iter=300)
        model = clustering.fit(row[['DENSITY']])
        row['KMeans_Clusters'] = model.labels_
        return row
    
    df = df.groupby('PROVINCE').apply(k_means)
    

    results

    URBAN   AREA    PROVINCE    DENSITY KMeans_Clusters
    0   0   1   TRUJILLO    0.30    0
    1   1   2   TRUJILLO    0.03    0
    2   2   3   TRUJILLO    0.80    1
    3   3   1   LIMA    1.20    1
    4   4   2   LIMA    0.04    0
    5   5   1   LAMBAYEQUE  0.90    0
    6   6   2   LAMBAYEQUE  0.10    1
    7   7   3   LAMBAYEQUE  0.08    1