Search code examples
pythonalgorithm3dcluster-analysispattern-recognition

Clustering 3D data with one nominal scale


Problem Statement

I have 2D pandas dataframes that hold data about user tool usage characteristics (e.g. 88% usage of System A, 11% usage of system B, 1% system C for a respective user:

        A      B       C
Usage  0,88   0,11   0,01

Assume that three users (ID: 1, 2, 3) are present the following matrices are present:

ID:1    A      B       C     ID:2    A      B      C     ID:3    A      B    C
Usage  0,88   0,11   0,01    Usage  0,86   0,13   0,01   Usage  0,00  0,00  1,00

I thought of aggregating the single 2D matrices to a 3D matrix to identify clusters of similar usage behaviour.

Goal

Identify clusters within system usage. In this example ID1 and ID2 should be clustered. I build a working DBSCAN method for clustering random 2D data.

However, I Face the problem of having the 2D matrices stacked in a fixed sequence within the aggregated 3D matrix. Thereby it is not possible to identify similarity while only looking at one fixed nominal sequence, because basically every user 2D data must be compared to all other 2D data to find smiliar usage behaviour.

Thoughts

I thought of integrating a method somewhat k-fold crossvalidation method for small data sets in machine learning. However I don't know how to integrate such a behavior into a clustering algorithm.

Another thought is that maybe pattern recognition or hierarchical clustering (although total number of clusters is unknown) is the better way to go as the third axis of the aggregated 3D matrix is on nominal scale (user ID). However, I am unexperienced within the domain of patter recognition up to this point.

Maybe someone has a good methodic idea to solve this clustering problem. :)


Solution

  • The example uses labels_true only for evaluation, not as input for the DBSCAN itself. The labels_true are derived from the function that creates the mock dataset. The correct way to call dbscan is db = DBSCAN(eps=0.3, min_samples=10).fit(X) Where x in your case is [[valueA, valueB, valueC], [valueA, valueB, valueC], ...] The result is then in db.labels_.