I have a directed graph (actually it is a hypergraph, but its ok to ignore that for the moment).
From this graph I pick various subgraphs and I'm looking for a function that ranks various such subsets by their 'cluster quality'.
'cluster quality' should be high when lots of links exist between the members of the subset
'cluster quality' should be low when lots of links exist from many members of the subset to or from outside the subset.
My question is:
What is the correct term for 'cluster quality'.?
What are the relevant algorithms / functions that do exist in this context?
What implementations exist on the JVM. Scala preferred, but anything callable from java is fine?
Background: The idea is to extract words from source code (class & method names or pieces thereof) and find those that might describe the application best by finding those that are used by 'good clusters' thus possibly representing some knowledge concept in the code.
In regards to algorithms/functions that are relevant to cluster analysis, there are several. Clustering in graphs is closely related to graph partitioning, which has recently been an active field of study, especially with the emergence of online social networks like Facebook and Twitter whose underlying structure is naturally represented by a (social) graph.
That being said, in my experience, two measures of clustering come to mind. One is modularity, which basically compares the sub-graph (cluster) to what a sub-graph would look like if the edges were distributed randomly.
Another is conductance, which measures how fast a random walk on the cluster candidate will converge to some uniform distribution.
Another, more loose measure, is looking at clustering coefficient, which measures the number of triangles (3-cycles) in a graph versus the number of possible triangles that could exist.
All in all, there's lots of algorithms (and academic papers) pertaining to this topic, the three I mention above are more general use cases.
Regarding an implementation on the JVM, there are no libraries I am aware of that come with such algorithms as part of it, but popular graph libraries for Scala are Graph for Scala (to be incorporated into the Scala Extended Core Library in the future) and Cassovary, released by Twitter.