I am importing a set of data from several files (excel files) that holds records with no identifiers on a daily basis. the data needs is then stored in a relational database (Oracle).
The problem is that the text may be slightly different from each resource and because there's no unique identifier I need to somehow base my comparison on text values.
Let's for example say that I get this information from different sources:
Source A: The Dark Knight
Source B: Batman The Dark Knight
Source C: The Dark Knight 2008
Source D: The Dark Knight Rises
if the database already hold an item with item_name as "The Dark Knight" then when i import this lines from sources A,B,C i'll get a "Full Match" but not for D cause that's a different movie.
Things to know:
How do I go about to solve it without inflating the database with tons of synonyms to each item ?
Update 05/21/2013
I have found that: http://matpalm.com/resemblance/
It's the use of jaccard coefficient. Altough i'm not sure it's the best for my case cause of complexity, matching m x n times where m is the size of the imported records and n is the total database records that could be tens of thousands long.