Search code examples
pythonscikit-learnk-means

How do you get the X value in KMeans clustering in python?


I'm a complete beginner for KMeans. How do you understand what X value to take? I have a dataframe with several rows and columns. I don't know how I can take one specific X value.

I cant substitute the entire dataframe. eg:

df = pd.read_csv("cereal.csv")
kmeans = KMeans(n_clusters=4)
kmeans.fit(X) ## How do I get this X? 

Solution

  • X is basically all the values from your dataframe which in this case is df.

    For example:

    from sklearn.cluster import KMeans
    
    X = df.values.astype(np.float)
    kmeans = KMeans(n_clusters = 4).fit(X)
    

    To see the labels assigned, you can now do:

    predicted_values = kmeans.labels_
    


    Note:

    You may have to perform data cleaning and remove features prior to passing it to the KMeans algorithm. In other words, some columns can be removed for example, ID if you have one.

    If any of your columns have string values, they need to be encoded into a numerical format. For example, you cannot pass values like high or low, you need to encode them into 0 or 1.