Applying KNN Clustering based on user id

Hello Community , I need help regarding how to apply KNN clustering on this use case.

I have a dataset consisting (27884 ROWS, 8933 Columns)

Here's a little preview of a dataset

user_iD	b1	b2	b3	b4	b5	b6	b7	b8	b9	b10	b11
1	1	7	2	3	8	0	4	0	6	0	5
2	7	8	1	2	4	6	5	9	10	3	0
3	0	0	0	0	1	5	2	3	4	0	6
4	1	7	2	3	8	0	5		6	0	4
5	0	4	7	0	6	1	5	3	0	0	2
6	1	0	2	3	0	5	4	0	0	6	7

Here the column userid represents: STUDENTS and columns b1-b11: They represent Book Chapters and the sequence of each student that which chapter he/she studied first then second then third and so on. the 0 entry tells that the student did not study that particular chapter.

This is just a small preview of a big dataset. There are a total of 27884 users and 8932 Chapters stated as (b1--b8932)

I need to find a similar pattern and thus need to apply KNN clustering, how do I do that?

Solution

Since you don't have class labels in your data, I'm guessing you may want K-Means to cluster your data, rather than KNN. Here's a simple K-Means example. If for some reason, you actually do want KNN for classification, please elaborate on classification labels and I will try to assist.

from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

df = pd.read_feather('Bundles.ftr')

# It's common to split your data into train and test groups See  
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html for more info.
df_train = df.head(500)

# put all of the feature columns into a list of lists
x_list = []
for idx, row in df_train.iterrows():
    x_list.append(row.iloc[1:].tolist())
# put our feature lists into np arrray
X = np.array(x_list)
# fit the data, tweak params as needed
kmeans = KMeans(n_clusters=10, random_state=0).fit(X)

# assign cluster labels to df
df_train['labels'] = kmeans.labels_

Next let's look at how many values are in each cluster.

df_train['labels'].value_counts()

From this cluster distribution, we can see that the data are unbalanced.

1    415
5     57
7      9
3      5
0      4
6      3
2      3
9      2
8      1
4      1
Name: labels, dtype: int64

If you want to predict which cluster other rows might belong to. This code tells us that the row at index 999 is predicted to belong in cluster 1.

kmeans.predict([df.iloc[999:1000, 1:].values.flatten().tolist()])

user_iD	b1	b2	b3	b4	b5	b6	b7	b8	b9	b10	b11
1	1	7	2	3	8	0	4	0	6	0	5
2	7	8	1	2	4	6	5	9	10	3	0
3	0	0	0	0	1	5	2	3	4	0	6
4	1	7	2	3	8	0	5		6	0	4
5	0	4	7	0	6	1	5	3	0	0	2
6	1	0	2	3	0	5	4	0	0	6	7

user_iD	b1	b2	b3	b4	b5	b6	b7	b8	b9	b10	b11
1	1	7	2	3	8	0	4	0	6	0	5
2	7	8	1	2	4	6	5	9	10	3	0
3	0	0	0	0	1	5	2	3	4	0	6
4	1	7	2	3	8	0	5		6	0	4
5	0	4	7	0	6	1	5	3	0	0	2
6	1	0	2	3	0	5	4	0	0	6	7

user_iD	b1	b2	b3	b4	b5	b6	b7	b8	b9	b10	b11
1	1	7	2	3	8	0	4	0	6	0	5
2	7	8	1	2	4	6	5	9	10	3	0
3	0	0	0	0	1	5	2	3	4	0	6
4	1	7	2	3	8	0	5		6	0	4
5	0	4	7	0	6	1	5	3	0	0	2
6	1	0	2	3	0	5	4	0	0	6	7