Search code examples
emailsearchcluster-analysishierarchical-clustering

How to cluster mail subject lines to mail threads?


If I only have the subject lines of mails (no other headers) is there a good algorithm (or package) to cluster them into a set of "related messages"?

A mail with the subject

  • Our travel plans

is probably related to

  • Re: Our travel plans and
  • Re: Re: Our travel plans.

So far so good, but there is also

  • AW: Our travel plans
  • Fwd: Our travel plans
  • Our travel plans (Forward)

I want to cluster all of them together into one thread. Mails with subjects like plans, Re: Our meeting and so on should not be in that thread, of course. I could very well live a hierarchical result -- actually, I kind of like that, because I'd expect that the chance that mails with similar content would get "closer" to each other.

So, i have a lot of ideas: Suffix matching, Prefix trees, Levensthein distances, Q-Gram profiles -- maybe too many. Therefore I ask myself: "Did anyone do this already?"


Solution

  • For sequence comparisons, I use Open Refine (formerly Google Refine) to try out clustering algorithms to fine-tune and identify the algorithm to use. It includes key collision (fingerprint, ngram & double-metaphone) and nearest neighbor (levenshtein distance & prediction by partial matching (PPM)).

    https://github.com/OpenRefine/OpenRefine/wiki/Installation-Instructions

    Once you have your data imported, just use facets to do your clustering.

    Facet > Text facet > Cluster