I have two spreadsheets that each give information about a bunch of applications running on my work's network. They were created by two separate people, who never ever seemed to correspond.
As a result, the names they have given the applications are not constant between sheets. They are, however, similar. For example, one might call an application "Office 2010", the other "MS Office 10" or something.
I have looked up the Levenshtein algorithm, but this seems to apply only to single words or phrases where the word order is constant, and only the spelling differs. (I am not a computer scientist; feel free to correct me on this).
Therefore I am looking for an algorithm that, for each name in one sheet, can cycle through every name in the other sheet and find the closest match. Doesn't have to be perfect, anything will help.
Any ideas? Thanks to all who can help out.
The Levenshtein Distance is a generalised form of edit distance, which counts the number of edits -- insertions, deletions, and substitutions -- it would take to transform one string to another. You're right that it doesn't handle transpositions very well, but depending on your needs it may still do the job.
Fuzzy string matching is a heuristic area, so the best thing to do is play around in order to try to meet your specific objectives. For example, you could try preprocessing the text by case-folding it and then sorting the tokens lexicographically prior to taking the edit distance, which would help with transpositions in many cases. You could also subtract the absolute difference in length between the two strings so that you get a low distance if one string is an approximate substring of the other -- be careful though, as if you do this the empty string will match everything.
In general you will always have a tradeoff between specificity and sensitivity, so the trick is simply to tune the heuristic in such a way that it performs in a way that you are comfortable with.