Search code examples
pattern-matchingneural-networksimilarityfuzzy

Determining if two or more summaries are similar


The problem is as follows:

I have one summary, usually between 20 to 50 words, that I'd like to compare to other relatively similar summaries. The general category and the geographical location to which the summary refers to are already known.

For instance, if people from the same area are writing about building a house, I'd like to be able to list those summaries with some level of certainty that they actually refer to building houses instead of building a garage or a backyard swimming pool.

The data set is currently around 50 000 documents with a growth rate of some 200 documents per day.

Preferred languages would be Python, PHP, C/C++, Haskell or Erlang, whichever might get the job done. Also, if you don't mind, I'd like to understand the reasoning for picking a specific language.


Solution

  • You could have a look at the WEBSOM project.

    Even though their web site has not been updated exactly this year, the problem being solved is very similar. As they were processing amounts of data similar to yours (and more) like 10 years ago, today you could probably run the algorithms almost on a cell phone.