email search cluster-analysis hierarchical-clustering

How to cluster mail subject lines to mail threads?

If I only have the subject lines of mails (no other headers) is there a good algorithm (or package) to cluster them into a set of "related messages"?

A mail with the subject

Our travel plans

is probably related to

Re: Our travel plans and
Re: Re: Our travel plans.

So far so good, but there is also

AW: Our travel plans
Fwd: Our travel plans
Our travel plans (Forward)

I want to cluster all of them together into one thread. Mails with subjects like plans, Re: Our meeting and so on should not be in that thread, of course. I could very well live a hierarchical result -- actually, I kind of like that, because I'd expect that the chance that mails with similar content would get "closer" to each other.

So, i have a lot of ideas: Suffix matching, Prefix trees, Levensthein distances, Q-Gram profiles -- maybe too many. Therefore I ask myself: "Did anyone do this already?"

Solution

For sequence comparisons, I use Open Refine (formerly Google Refine) to try out clustering algorithms to fine-tune and identify the algorithm to use. It includes key collision (fingerprint, ngram & double-metaphone) and nearest neighbor (levenshtein distance & prediction by partial matching (PPM)).

https://github.com/OpenRefine/OpenRefine/wiki/Installation-Instructions

Once you have your data imported, just use facets to do your clustering.

Facet > Text facet > Cluster