Dataset file : google drive link
Hello Community , I need help regarding how to apply KNN clustering on this use case.
I have a dataset consisting (27884 ROWS, 8933 Columns)
Here's a little preview of a dataset
user_iD | b1 | b2 | b3 | b4 | b5 | b6 | b7 | b8 | b9 | b10 | b11 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 7 | 2 | 3 | 8 | 0 | 4 | 0 | 6 | 0 | 5 |
2 | 7 | 8 | 1 | 2 | 4 | 6 | 5 | 9 | 10 | 3 | 0 |
3 | 0 | 0 | 0 | 0 | 1 | 5 | 2 | 3 | 4 | 0 | 6 |
4 | 1 | 7 | 2 | 3 | 8 | 0 | 5 | 6 | 0 | 4 | |
5 | 0 | 4 | 7 | 0 | 6 | 1 | 5 | 3 | 0 | 0 | 2 |
6 | 1 | 0 | 2 | 3 | 0 | 5 | 4 | 0 | 0 | 6 | 7 |
Here the column userid represents: STUDENTS and columns b1-b11: They represent Book Chapters and the sequence of each student that which chapter he/she studied first then second then third and so on. the 0 entry tells that the student did not study that particular chapter.
This is just a small preview of a big dataset. There are a total of 27884 users and 8932 Chapters stated as (b1--b8932)
I need to find a similar pattern and thus need to apply KNN
clustering, how do I do that?
Since you don't have class labels in your data, I'm guessing you may want K-Means to cluster your data, rather than KNN. Here's a simple K-Means example. If for some reason, you actually do want KNN for classification, please elaborate on classification labels and I will try to assist.
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
df = pd.read_feather('Bundles.ftr')
# It's common to split your data into train and test groups See
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html for more info.
df_train = df.head(500)
# put all of the feature columns into a list of lists
x_list = []
for idx, row in df_train.iterrows():
x_list.append(row.iloc[1:].tolist())
# put our feature lists into np arrray
X = np.array(x_list)
# fit the data, tweak params as needed
kmeans = KMeans(n_clusters=10, random_state=0).fit(X)
# assign cluster labels to df
df_train['labels'] = kmeans.labels_
Next let's look at how many values are in each cluster.
df_train['labels'].value_counts()
From this cluster distribution, we can see that the data are unbalanced.
1 415
5 57
7 9
3 5
0 4
6 3
2 3
9 2
8 1
4 1
Name: labels, dtype: int64
If you want to predict which cluster other rows might belong to. This code tells us that the row at index 999 is predicted to belong in cluster 1.
kmeans.predict([df.iloc[999:1000, 1:].values.flatten().tolist()])