Similar questions in the database seem to be much more complicated than my example. I want to cluster 100'ish points on a line. Number of groups is irrelevant; the closeness of points is more important.
What is a term, method or algorithm to deal with this grouping problem? K-means, Hamming distance, hierarchical agglomeration, clique or complete linkage??
I've reduced two examples to bare minimum for clarification:
Simple example:
Set A = {600, 610, 620, 630} and the set of differences between its elements is diff_A = {10, 20, 30, 10, 20, 10}. I can then group as follows: {10, 10, 10}, {20, 20}, and {30}. Done.
Problematic example:
Set B = {600, 609, 619, 630} and the set of differences is diff_B = {9, 10, 11, 19, 21, 30}. I try to group with a tolerance of 1, i.e. differences that are 1 (or less) are 'similar enough' to be grouped but I get a paradox: {9, 10} AND/OR {10, 11}, {19}, {21}, and {30}.
Issue:
9 and 10 are close enough, 10 and 11 are close enough, but 9 and 11 are not, so how should I handle these overlapping groups? Perhaps this small example is unsolvable because it is symmetrical?
Why do you work on the pairwise differences? Consider the values 1, 2, 101, 102, 201, 202. Pairwise differences are 1,100,101,200,201,99,100,199,200,1,100,101,99,100,1
The values of ~200 bear no information. There is a different "cluster" inbetween. You shouldn't use them for your analysis.
Instead, grab a statistics textbook and look up Kernel Density Estimation. Don't bother to look for clustering - these methods are usually designed for the multivariate case. Your data is 1 dimensional. It can be sorted (it probably already is), and this can be exploited for better results.
There are well-established heuristics for density estimation on such data, and you can split your data on local minimum density (or simply at a low density threshold). This is much simpler, yet robust and reliable. You don't need to set a paramter such as k for k-means. There are cases where k-means is a good choice - it has origins in signal detection, where it was known that there are k=10 different signal frequencies. Today, it is mostly used for multidimensional data.
See also: