I have a pandas dataframe like this, where each ID is an observation with variables attr1, attr2 and attr3:
ID attr1 attr2 attr3
20 2 1 2
10 1 3 1
5 2 2 4
7 1 2 1
16 1 2 3
28 1 1 3
35 1 1 1
40 1 2 3
46 1 2 3
21 3 1 3
and made a similarity matrix I want to use where the IDs are compared based on the sum of the pairwise attribute differences.
[[ 0. 4. 3. 3. 3. 2. 2. 3. 3. 2.]
[ 4. 0. 5. 1. 3. 4. 2. 3. 3. 6.]
[ 3. 5. 0. 4. 2. 3. 5. 2. 2. 3.]
[ 3. 1. 4. 0. 2. 3. 1. 2. 2. 5.]
[ 3. 3. 2. 2. 0. 1. 3. 0. 0. 3.]
[ 2. 4. 3. 3. 1. 0. 2. 1. 1. 2.]
[ 2. 2. 5. 1. 3. 2. 0. 3. 3. 4.]
[ 3. 3. 2. 2. 0. 1. 3. 0. 0. 3.]
[ 3. 3. 2. 2. 0. 1. 3. 0. 0. 3.]
[ 2. 6. 3. 5. 3. 2. 4. 3. 3. 0.]]
I tried DBSCAN from sklearn for clustering the data, but it seems only the clusters themselves are labeled? I want to find the ID for the data points in the visualization later. So I only want to cluster the difference between the IDs, but not the IDs themselves. Is there another algorithm better for this kind of data, or a way I can label the distance matrix values so it can be used with the DBSCAN or another method? ps.the dataset has over 50 attributes and 10000 observations
The labels_ attribute will give you an array of labels for each of your data points from training. The first index of that array is the label of your first training data point and so on.