Search code examples
pythoncluster-analysisk-means

Create clusters depending on scores performance


I have data from students who took a test that has 2 sections : the 1st section tests their digital skill at level2, and the second section tests their digital skills at level3. I need to come up with 3 clusters of students depending on their scores to place them in 3 different skills levels (1,2 and 3) --> code sample below

import pandas as pd

data = [12,24,14,20,8,10,5,23]
  
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
        'Scores_section1': [12,24,14,20,8,10,5,23],
       'Scores_section2' : [20,4,1,0,18,9,12,10],
       'Sum_all_scores': [32,28,15,20,26,19,17,33]}
  
# Create DataFrame
df = pd.DataFrame(data)
  
# print dataframe.
df

I thought about using K-means clustering, but following a tutorial online, I'd need to use x,y coordinates. Should I use scores_section1 as x, and Scores_section2 as y or vice-versa, and why?

Many thanks in advance for your help!


Solution

  • Try it this way.

    import pandas as pd
    
    data = [12,24,14,20,8,10,5,23]
      
    # initialize data of lists.
    data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
            'Scores_section1': [12,24,14,20,8,10,5,23],
           'Scores_section2' : [20,4,1,0,18,9,12,10],
           'Sum_all_scores': [32,28,15,20,26,19,17,33]}
      
    # Create DataFrame
    df = pd.DataFrame(data)
      
    # print dataframe.
    df
    
    
    #Import required module
    from sklearn.cluster import KMeans
     
    #Initialize the class object
    kmeans = KMeans(n_clusters=3)
     
    #predict the labels of clusters.
    df = df[['Scores_section1', 'Scores_section2', 'Sum_all_scores']]
    label = kmeans.fit_predict(df)
    label
    
    
    df['kmeans'] = label
    df
    
    
    # K-Means Clustering may be the most widely known clustering algorithm and involves assigning examples to 
    # clusters in an effort to minimize the variance within each cluster.
    # The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets 
    # on the basis of a sample. The process, which is called ‘k-means,’ appears to give partitions which are reasonably 
    # efficient in the sense of within-class variance.
    
    # plot X & Y coordinates and color by cluster number
    import plotly.express as px
    fig = px.scatter(df, x="Scores_section1", y="Scores_section2", color="kmeans", size='Sum_all_scores', hover_data=['kmeans'])
    fig.show()
    

    enter image description here

    Feel free to modify the code to suit your needs.