Search code examples
pythonpython-3.xcluster-analysisk-means

How to make KMeans Clustering more Meaningful for Titanic Data?


I'm running this code.

import pandas as pd
titanic = pd.read_csv('titanic.csv')
titanic.head()


#Import required module
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = titanic['Name']

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

from sklearn.cluster import KMeans

# initialize kmeans with 20 centroids
kmeans = KMeans(n_clusters=20, random_state=42)
# fit the model
kmeans.fit(X)
# store cluster labels in a variable
clusters = kmeans.labels_
titanic['kmeans'] = clusters
titanic.tail()

Finally...

from sklearn.decomposition import PCA

documents = titanic['Name']

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

# initialize PCA with 2 components
pca = PCA(n_components=2, random_state=42)
# pass our X to the pca and store the reduced vectors into pca_vecs
pca_vecs = pca.fit_transform(X.toarray())

# save our two dimensions into x0 and x1
x0 = pca_vecs[:, 0]
x1 = pca_vecs[:, 1]

# assign clusters and pca vectors to our dataframe 
titanic['cluster'] = clusters
titanic['x0'] = x0
titanic['x1'] = x1

titanic.head()

import plotly.express as px

fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', text='Name')
fig.show()

Here is the plot that I see.

enter image description here

I guess it's working...but my question is...how can we make the text more dispersed and/or remove outliers so the chart is more meaningful? I'm guessing that the clustering is correct, because I'm not doing anything special here, but is there some way to make the clustering more significant or meaningful?

Data is sourced from here.

https://www.kaggle.com/competitions/titanic/data?select=test.csv


Solution

  • You could make the name information be displayed only upon mouse hover over a certain data point. Currently, you're trying to plot the names of each passenger alongside the data point. Since there are a lot of data points close to each other, including the name directly on the plot results in the names of each passenger being placed on top of each other. You could fix this by changing the plot code to something like:

    fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', hover_name='Name')
    fig.update_layout(title_text="KMeans Clustering of Titanic Passengers",
                      title_font_size=30)
    fig.show()
    

    Basically, the only thing we changed on the above code is which parameter we're using to include the 'Name' information. Here's how it looks after this change:

    New plot

    Now, the names are only shown when you hover your mouse over the data point.

    Complete code

    Here's your complete code, considering the above-mentioned change:

    # Import required module
    import pandas as pd
    import plotly.express as px
    from sklearn.cluster import KMeans
    from sklearn.decomposition import PCA
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    # Where our data is located in our machine
    train_data_filepath = '/Users/erikingwersen/Downloads/train.csv'
    test_data_filepath = '/Users/erikingwersen/Downloads/test.csv'
    
    # Read the train data from downloaded file
    titanic = pd.read_csv(train_data_filepath)
    
    documents = titanic['Name']
    
    X = TfidfVectorizer(stop_words='english').fit_transform(documents)
    
    # Initialize kmeans with 20 centroids
    kmeans = KMeans(n_clusters=20, random_state=42)
    
    # Fit the model
    kmeans.fit(X)
    
    # Store cluster labels in a variable
    clusters = kmeans.labels_
    titanic['kmeans'] = clusters
    documents = titanic['Name']
    
    X = TfidfVectorizer(stop_words='english').fit_transform(documents)
    
    # Initialize PCA with 2 components
    pca = PCA(n_components=2, random_state=42)
    
    # Pass our X to the pca and store the reduced vectors into pca_vecs
    pca_vecs = pca.fit_transform(X.toarray())
    
    # Save our two dimensions into x0 and x1
    x0, x1 = pca_vecs[:, 0], pca_vecs[:, 1]
    
    # Assign clusters and pca vectors to our dataframe 
    titanic[['cluster', 'x0', 'x1']] = [
        [x, y, z] for x, y, z in zip(clusters, x0, x1)
    ]
    
    
    titanic.head()
    
    fig = px.scatter(titanic, x='x0', y='x1', color='kmeans', hover_name='Name')
    fig.update_layout(title_text="KMeans Clustering of Titanic Passengers",
                      title_font_size=30)
    fig.show()