Search code examples
python-3.xcluster-analysis

How to perform clustering on this list of data?


My knowledge of clustering analysis and data science is superficial. My problem is to group the following data into clusters:

Data = [40,45,50,60]

My criteria is to group the numbers when the difference between each pair of numbers is within a certain threshold (let's say 10). So the possible clusters are:

Cluster1 = [40,45] [50,60]
Cluster2 = [40,45,50] [60]
Cluster3 = [40][45,50][60]

I need to find all such possible clusters and select one of them based on a certain condition. Is there any data science library which I can use to perform such clustering?


Solution

  • Since your data is one dimensional, the problem becomes much easier than the usual clustering scenario which is multivariate.

    You can use a very simple strategy to enumerate all possible "clusterings":

    1. Sort your data
    2. Begin with the smallest value
    3. If the next value is within the threshold add it to the cluster and continue
    4. Backtrack, and try without adding the value to the existing cluster, but rather begin a new cluster.

    Stop looking for a library for everything, and just code this yourself. Clustering libraries solve more complicated problems and will usually not include such simple univariate strategies.