Search code examples
pythonmachine-learningcluster-analysisunsupervised-learningfeature-clustering

Unsupervised Clustering of large multi-dimentional data


Hello I am a machine learning newbie. I need some help with unsupervised clustering of high dimentional data. I have data with over 15 dimensions with around 50 - 80 thousand rows. The data looks something like this (15 participants with almost equal number of rows each and 15 features) -

Participant time feature 1 feature 2...
1 0.05 val val
1 0.10 val val
2 0.05 val val
2 0.10 val val
2 0.15 val val

The data consists of many participants, each participant has multiple rows of data and they are time stamped with their features. My goal is to cluster this data according to participants and make inferences based on these clusters. The problem here is that there are many rows for each participant and I cannot represent each participant with a single point so clustering them seems like a difficult task.

I would like help with:

  1. What would be the best way to cluster this data so that I can make inferences according to the participant ?

  2. Which clustering technique should I use? I have tried sklearn's Kmeans, meanshift and other libraries but they take too long and crash my system.

Sorry If it's a bit difficult to understand I will try my best to answer your questions. Thank you in advance for the help. If this question is very similar to some other question please let me know (I was not able to find it).

Thank you :)


Solution

  • Since you have trouble with the necessary amount of compute you have to make some sort of compromise here. Here's a few suggestions that will likely fix your problem, but they all come at a cost.

    1. Dimension reduction i.e. PCA to reduce your number of columns to ~2 or so. You will lose some information, but you will be able to plot it and do inference via K-means.

    2. Average the patients data. Not sure if this will be enough, this depends on you data. This will lose the over-time observation of your patients but likely drastically reduce your number of rows.

    My suggestion is to do dimension reduction since losing the over time data on your patients might render your data useless. There is also other stuff beside PCA, for example auto encoders. For clustering the way your descibe I'd recommend you stick to K-means or soft K-means.