I have data from students who took a test that has 2 sections : the 1st section tests their digital skill at level2, and the second section tests their digital skills at level3. I need to come up with 3 clusters of students depending on their scores to place them in 3 different skills levels (1,2 and 3) --> code sample below
import pandas as pd
data = [12,24,14,20,8,10,5,23]
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
'Scores_section1': [12,24,14,20,8,10,5,23],
'Scores_section2' : [20,4,1,0,18,9,12,10],
'Sum_all_scores': [32,28,15,20,26,19,17,33]}
# Create DataFrame
df = pd.DataFrame(data)
# print dataframe.
df
I thought about using K-means clustering, but following a tutorial online, I'd need to use x,y coordinates. Should I use scores_section1 as x, and Scores_section2 as y or vice-versa, and why?
Many thanks in advance for your help!
Try it this way.
import pandas as pd
data = [12,24,14,20,8,10,5,23]
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
'Scores_section1': [12,24,14,20,8,10,5,23],
'Scores_section2' : [20,4,1,0,18,9,12,10],
'Sum_all_scores': [32,28,15,20,26,19,17,33]}
# Create DataFrame
df = pd.DataFrame(data)
# print dataframe.
df
#Import required module
from sklearn.cluster import KMeans
#Initialize the class object
kmeans = KMeans(n_clusters=3)
#predict the labels of clusters.
df = df[['Scores_section1', 'Scores_section2', 'Sum_all_scores']]
label = kmeans.fit_predict(df)
label
df['kmeans'] = label
df
# K-Means Clustering may be the most widely known clustering algorithm and involves assigning examples to
# clusters in an effort to minimize the variance within each cluster.
# The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets
# on the basis of a sample. The process, which is called ‘k-means,’ appears to give partitions which are reasonably
# efficient in the sense of within-class variance.
# plot X & Y coordinates and color by cluster number
import plotly.express as px
fig = px.scatter(df, x="Scores_section1", y="Scores_section2", color="kmeans", size='Sum_all_scores', hover_data=['kmeans'])
fig.show()
Feel free to modify the code to suit your needs.