If I only have the subject lines of mails (no other headers) is there a good algorithm (or package) to cluster them into a set of "related messages"?
A mail with the subject
Our travel plans
is probably related to
Re: Our travel plans
andRe: Re: Our travel plans
.So far so good, but there is also
AW: Our travel plans
Fwd: Our travel plans
Our travel plans (Forward)
I want to cluster all of them together into one thread. Mails with subjects like plans
, Re: Our meeting
and so on should not be in that thread, of course. I could very well live a hierarchical result -- actually, I kind of like that, because I'd expect that the chance that mails with similar content would get "closer" to each other.
So, i have a lot of ideas: Suffix matching, Prefix trees, Levensthein distances, Q-Gram profiles -- maybe too many. Therefore I ask myself: "Did anyone do this already?"
For sequence comparisons, I use Open Refine (formerly Google Refine) to try out clustering algorithms to fine-tune and identify the algorithm to use. It includes key collision (fingerprint, ngram & double-metaphone) and nearest neighbor (levenshtein distance & prediction by partial matching (PPM)).
https://github.com/OpenRefine/OpenRefine/wiki/Installation-Instructions
Once you have your data imported, just use facets to do your clustering.
Facet > Text facet > Cluster