Search code examples
social-networkingdiscrete-mathematics

Naming clusters in the graph


We have social graph that is later broken to clusters of high cohesion. Something called Truss by Jonathan Cohen [1].

Now that I have those clusters, I would like to come up with names for them. Cluster name should allow insignificant changes to the cluster size without changing the name.

For example: Let's assume we have cluster M:

M : {A, B, C, D, E, F}

and let's assume that "naming algorithm" generated name " m " for it.

After some time, vertex A has left the cluster, while vertex J has joined:

M : {B, C, D, E, F, J}

Newly generated name is " m' ".

Desired feature:

m' == m     for insignificant cluster changes

[1] http://www.cslu.ogi.edu/~zak/cs506-pslc/trusses.pdf


Solution

  • Based on your example, I assume you mean "insignificant changes to the cluster composition", not to the "cluster size".

    If your naming function f() cannot use the information about the existing name for the given cluster, you would have to allow that sometimes it does rename despite the change being small. Indeed, suppose that f() never renames a cluster when it changes just a little. Starting with cluster A, you can get to any other cluster B by adding or removing only one element at a time. By construction, the function will return the same name for A and B. Since A, B were arbitrary, f() will return the same name for all possible clusters - clearly useless.

    So, you have two alternatives:

    (1) the naming function relies on the existing name of a cluster, or

    (2) the naming function sometimes (rarely) renames a cluster after a very tiny change.

    If you go with alternative (1), it's trivial. You can simply assign names randomly, and then keep them unchanged whenever the cluster is updated as long as it's not too different (however you define different). Given how simple it is, I suppose that's not what you want.

    If you go with alternative (2), you'll need to use some information about the underlying objects in the cluster. If all you have are links to various objects with no internal structure, it can't be done, since the function wouldn't have anything to work with apart from cluster size.

    So let's say you have some information about the objects. For example, you may have their names. Call the first k letters of each object's name the object's prefix. Count all the different prefixes in your cluster, and find the n most common ones. Order these n prefixes alphabetically, and append them to each other in that order. For a reasonable choice of k, n (which should depend on the number of your clusters and typical object name lengths), you would get the result you seek - as long as you have enough objects in each cluster.

    For instance, if objects have human names, try k = 2; and if you have hundreds of clusters, perhaps try n = 2.

    This of course, can be greatly improved by remapping names to achieve a more uniform distribution, handling the cases where two prefixes have similar frequencies, etc.