I have to group some hotel into the same category based on their names. I'm using levenshtein for grouping, but how much I've tried, some hotel are leaved outside the category they supposed to be, or in another category.
For example: all these hotel should be in the same category:
=============================
Best Western Bercy Rive Gauche
Best Western Colisee
Best Western Ducs De Bourgogne
Best Western Folkestone Opera
Best Western France Europe
Best Western Hotel Sydney Opera
Best Western Paris Louvre Opera
Best Western Hotel De Neuville
=============================
I'm having a list with all hotel names( like 1000 rows ). I also have how they should be grouped. Any idea how to optimize levenshtein, making it more flexible for my situation?
$inserted = false;
foreach($hotelList as $key => $value){
if (levenshtein($key, $hotelName, 2, 5, 1) <= abs(strlen($key) - strlen($hotelName))){
array_push($hotelList[$key], trim($line));
$inserted = true;
}
}
// if no match was found add another entry
if (!$inserted){
$hotelList[$hotelName] = array(
trim($line)
);
}
I'll wade in with my thoughts. Firstly, grouping or "clustering" data like this is a pretty big topic, I won't really go into it particularly but perhaps point things in an ideal direction.
You did a brilliant thing by normalizing Levenshtein on the length of the strings compared- that's exactly right because you avoid the problem that the length of the string would overdetermine the similarity in many cases.
But the algorithm didn't solve the problem. For a start, we want to compare words. "Bent Eastern French Hotels" is obviously very different to "Best Western French Hotels", yet it would score better than "Best Western Paris Bed and Breakfasts", say. The intution to grasp here is that your tokens shouldn't be characters but words.
I like @saury's answer, but I'm not sure about the assumption at the beginning. Instead, let's start with something nice and easy often called "bag of words". We then implement a hashing trick, which would allow you to idetify the key phrases based on the intuition that the least used words contain the most information.
If you subscribe to the idea that hotel brand names are near the beginning you could always skew on their proximity to the start of the string too. Thing is, your groups will as likely end up being "France" as "Best" / "Western" (but not "hotel"- why?).
You want your results to be more accurate?
From here on in, we're gonna have to take a step up to some serious algorithms- enjoy surfing the many stack overflow topics. My instinct is that I bet many hotel names aren't branded at all, so you'll need different categories for them too. And my instinct is also that the number of repeated words in hotel names is going to be relatively slim- some words will be frequent members of hotel names. These facts would be problems for the above. In this case, there's a really popular (if cliched for SO) technique called k-means, a fun introduction to which would be to extend an algorithm like this (very bravely written in php) to take your chosen n keyphrases as the n dimensions of the cluster, then take the majority components of the cluster center-points as your categorization tags. (That would eliminate "France", say, because hits for "France" would be spread across the n-dimensional space pretty evenly).
This is probably all a bit much to take on for something that would seem like a small problem- but I want to emphasize that if your data isn't structured, there really aren't any short-cuts to doing things properly.