Search code examples
pythonpandascluster-analysisk-means

Clustering using Python


I have data that resembles this:

import pandas as pd
import random
random.seed(901)
rand_list1= []
rand_list2= []
rand_list3= []
rand_list4= []
rand_list5= []

for i in range(20):
    x = random.randint(80,1000)
    rand_list1.append(x/100)
    
    y1 = random.randint(-200,200)
    rand_list2.append(y1/10)
    y2 = random.randint(-200,200)
    rand_list3.append(y2/10)
    y3 = random.randint(-200,200)
    rand_list4.append(y3/10)
    y4 = random.randint(-200,200)
    rand_list5.append(y4/10)

df = pd.DataFrame({'Rainfall Recorded':rand_list1, 'TAXI A':rand_list2, 'TAXI B':rand_list3, 'TAXI C':rand_list4, 'TAXI D':rand_list5})

df.head()


   Rainfall Recorded    TAXI A  TAXI B  TAXI C  TAXI D
0               5.21    13.7    -5.0    -14.2   9.8
1               2.39    -0.3    18.8    4.8     -6.4
2               8.09    15.0    -3.6    18.6    12.7
3               5.79    -0.2    14.6    0.9     3.8
4               7.48    10.9    9.0     15.4    -16.5

Given the Rainfall recorded in our region in centimeters, these are the % change in earnings reported by TAXI drivers surveyed. Can I use K MEANS CLUSTERING to determine whether the TAXIS operated in our locality or not? Suppose there is relationship between Rainfall recorded and the Earnings change.

I have simple code got from web source:

km = KMeans(n_clusters=2)
y_predicted = km.fit_predict(df[['TAXI','Rainfall Recorded']])
y_predicted

But I am unsure what transformations need to be done before using this code.


Solution

  • import pandas as pd
    from sklearn.cluster import KMeans
    import numpy as np
    import random
    
    random.seed(901)
    rand_list1 = []
    rand_list2 = []
    rand_list3 = []
    rand_list4 = []
    rand_list5 = []
    
    for i in range(20):
        x = random.randint(80, 1000)
        rand_list1.append(x / 100)
        
        y1 = random.randint(-200, 200)
        rand_list2.append(y1 / 10)
        y2 = random.randint(-200, 200)
        rand_list3.append(y2 / 10)
        y3 = random.randint(-200, 200)
        rand_list4.append(y3 / 10)
        y4 = random.randint(-200, 200)
        rand_list5.append(y4 / 10)
    
    df = pd.DataFrame({
        'Rainfall Recorded': rand_list1, 
        'TAXI A': rand_list2, 
        'TAXI B': rand_list3, 
        'TAXI C': rand_list4, 
        'TAXI D': rand_list5
    })
    
    # Number of clusters
    k = 2
    
    # Function to apply k-means to each row
    def cluster_row(row, n_clusters):
        # Extract the taxi data
        taxi_data = row[['TAXI A', 'TAXI B', 'TAXI C', 'TAXI D']].values.reshape(-1, 1)
        kmeans = KMeans(n_clusters=n_clusters, random_state=0)
        kmeans.fit(taxi_data)
        return kmeans.labels_
    
    # Apply the function to each row and store the cluster labels
    df['Taxi Clusters'] = df.apply(lambda row: cluster_row(row, k), axis=1)
    
    print(df)
    

    This gives Taxi Clusters for each row of the entries made recording the rainfall received in our locality.