Applying PCA Clustering based on user id

I have a dataset consisting (27884 ROWS, 8933 Columns)

Here's a little preview of a dataset

user_iD	b1	b2	b3	b4	b5	b6	b7	b8	b9	b10	b11
1	1	7	2	3	8	0	4	0	6	0	5
2	7	8	1	2	4	6	5	9	10	3	0
3	0	0	0	0	1	5	2	3	4	0	6
4	1	7	2	3	8	0	5	0	6	0	4
5	0	4	7	0	6	1	5	3	0	0	2
6	1	0	2	3	0	5	4	0	0	6	7

Here the column userid represents: STUDENTS and columns b1-b11: They represent Book Chapters and the sequence of each student that which chapter he/she studied first then second then third and so on. the 0 entry tells that the student did not study that particular chapter.

This is just a small preview of a big dataset. There are a total of 27884 users and 8932 Chapters stated as (b1--b8932)

Here's the complete dataset shape information

I'm Applying PCA. and I'm getting an error which is : ValueError: Found array with 0 feature(s) (shape=(22307, 0)) while a minimum of 1 is required.

As I stated there are 27844 users & 8932 other columns

What I have tried so far

    df3 = pd.read_feather('Bundles.ftr')

X = df3['user_iD']
y = df3.loc[:, df3.columns != 'user_iD']

# Splitting the X and Y into the
# Training set and Testing set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# performing preprocessing part
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train= X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Applying PCA function on training
# and testing set of X component
from sklearn.decomposition import PCA

pca = PCA(n_components = 2)

X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

explained_variance = pca.explained_variance_ratio_

How do I apply PCA to this use case?

Solution

Here's how to use PCA to pre-process your data:

df3 = pd.read_feather('Bundles.ftr')
X = df3.loc[:, df3.columns != 'user_iD']

# Splitting the X into the
# Training set and Testing set

X_train, X_test = train_test_split(X, test_size = 0.2, random_state = 0)

# performing preprocessing part
X_train = X_train.values
X_test = X_test.values
# Applying PCA function on training
# and testing set of X component
print(X_train.shape)
print(X_test.shape)

pca = PCA(n_components = 2)

X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

explained_variance = pca.explained_variance_ratio_

This is what the X_train variable looks like after preprocessing:

array([[-1846.8651992 ,   437.17734222],
       [-1847.05838019,   437.41158726],
       [-1845.67443438,   436.28046074],
       ...,
       [-1847.00651974,   437.20374889],
       [ -780.18296423,   116.65908052],
       [-1847.09404683,   437.30545959]])

However, I don't think that PCA is the right tool here. There are a few reasons for this:

I think kNN would be easier to interpret.
The way that the input weights are encoded is combining ordinal information and categorical information, which will make your clustering algorithm not work as well.

For example, if one user read a chapter first, and another user doesn't read a chapter at all, that is assigned 1 and 0. In this case, higher number means the user is more interested.

In another case, if one user read a chapter seventh, and another user read it eighth, that is assigned 7 and 8. In this case, a higher number means that the user is less interested.

On top of that, you're saying that the difference between reading something seventh or eighth is the same as the difference as between not reading it at all. To me, if someone didn't read it, that's a much bigger difference than a slight change in reading order.

So I would suggest having two sets of input features: did they read it at all, and if they did, where in their reading did the chapter fall.

The first set of features could be computed like this:
```
did_read = (X.values >= 1).astype(int)
```
These features are 1 if read and 0 otherwise.

The second set of features could be computed like this:
```
X_values = X.values
max_order = X_values.max(axis=1, initial=1).reshape(-1, 1)
order_normalized = X_values / max_order
```
These features are in the range [0, 1] based on whether it was toward the beginning or end of the chapters that they read.

user_iD	b1	b2	b3	b4	b5	b6	b7	b8	b9	b10	b11
1	1	7	2	3	8	0	4	0	6	0	5
2	7	8	1	2	4	6	5	9	10	3	0
3	0	0	0	0	1	5	2	3	4	0	6
4	1	7	2	3	8	0	5	0	6	0	4
5	0	4	7	0	6	1	5	3	0	0	2
6	1	0	2	3	0	5	4	0	0	6	7

user_iD	b1	b2	b3	b4	b5	b6	b7	b8	b9	b10	b11
1	1	7	2	3	8	0	4	0	6	0	5
2	7	8	1	2	4	6	5	9	10	3	0
3	0	0	0	0	1	5	2	3	4	0	6
4	1	7	2	3	8	0	5	0	6	0	4
5	0	4	7	0	6	1	5	3	0	0	2
6	1	0	2	3	0	5	4	0	0	6	7

user_iD	b1	b2	b3	b4	b5	b6	b7	b8	b9	b10	b11
1	1	7	2	3	8	0	4	0	6	0	5
2	7	8	1	2	4	6	5	9	10	3	0
3	0	0	0	0	1	5	2	3	4	0	6
4	1	7	2	3	8	0	5	0	6	0	4
5	0	4	7	0	6	1	5	3	0	0	2
6	1	0	2	3	0	5	4	0	0	6	7