Search code examples
pythonpandasnumpyknn

Applying KNN Clustering based on user id


Dataset file : google drive link

Hello Community , I need help regarding how to apply KNN clustering on this use case.

I have a dataset consisting (27884 ROWS, 8933 Columns)

Here's a little preview of a dataset

user_iD b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11
1 1 7 2 3 8 0 4 0 6 0 5
2 7 8 1 2 4 6 5 9 10 3 0
3 0 0 0 0 1 5 2 3 4 0 6
4 1 7 2 3 8 0 5 6 0 4
5 0 4 7 0 6 1 5 3 0 0 2
6 1 0 2 3 0 5 4 0 0 6 7

Here the column userid represents: STUDENTS and columns b1-b11: They represent Book Chapters and the sequence of each student that which chapter he/she studied first then second then third and so on. the 0 entry tells that the student did not study that particular chapter.

This is just a small preview of a big dataset. There are a total of 27884 users and 8932 Chapters stated as (b1--b8932)

I need to find a similar pattern and thus need to apply KNN clustering, how do I do that?


Solution

  • Since you don't have class labels in your data, I'm guessing you may want K-Means to cluster your data, rather than KNN. Here's a simple K-Means example. If for some reason, you actually do want KNN for classification, please elaborate on classification labels and I will try to assist.

    from sklearn.cluster import KMeans
    import numpy as np
    import pandas as pd
    
    df = pd.read_feather('Bundles.ftr')
    
    # It's common to split your data into train and test groups See  
    # https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html for more info.
    df_train = df.head(500)
    
    # put all of the feature columns into a list of lists
    x_list = []
    for idx, row in df_train.iterrows():
        x_list.append(row.iloc[1:].tolist())
    # put our feature lists into np arrray
    X = np.array(x_list)
    # fit the data, tweak params as needed
    kmeans = KMeans(n_clusters=10, random_state=0).fit(X)
    
    # assign cluster labels to df
    df_train['labels'] = kmeans.labels_
    

    Next let's look at how many values are in each cluster.

    df_train['labels'].value_counts()
    

    From this cluster distribution, we can see that the data are unbalanced.

    1    415
    5     57
    7      9
    3      5
    0      4
    6      3
    2      3
    9      2
    8      1
    4      1
    Name: labels, dtype: int64
    

    If you want to predict which cluster other rows might belong to. This code tells us that the row at index 999 is predicted to belong in cluster 1.

    kmeans.predict([df.iloc[999:1000, 1:].values.flatten().tolist()])