I have a website where I index information about movies. To find a movie the user can input the title of the movie and this query will be sent to the database:
SELECT IMDB, Name, Year, Views
FROM Movies
WHERE Name LIKE '%$search%'
I used the "similar_text" function to solve little mistakes. For example if the name is "Pulp Fiction" and the user types: "Pulp Foction", since I don't get any results I run this little code for every movie.
similar_text($search, $Name, $percent);
if ($percent > $ValMax) {
$ValMax = $percent;
$PosMax = $i;
}
I'm using this code since I only have a couple hundred movies in the database and in the future it's going to be at max about twenty thousand. Also every movie as a title of an average of 20 characters, without considering movies like Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb
The real problem starts with a query used by an user: "Capitan America" (italian for Captain America) where the name of the movie was "Captain America – Il primo vendicatore"
Using the Levenshtein function I got horrible results where with the similar_text they were at least decent. You can see the full results of that query here. I made up this page to see for each search what the levenshtein and the similar_text would be.
The movie that would have been selected if I didn't place a 60% blockage would have been "C'era una volta in America" (Once upon a time in America).
However "Capitan America" and "Captain America" are pretty similar so I was wondering if there was a way to maybe check for each word. I've also heard about Lucene but I didn't know how to start from and If it was suited for my needs.
Thanks :)
You could use SOUNDS LIKE
(available as a MySQL string function):
SELECT IMDB, Name, Year, Views FROM Movies WHERE Name SOUNDS LIKE '%$search%'
You might want to benchmark the performance of this on large tables as the preceding wildcard in '%$search%'
means that your statement will not be able to avail of any indexes.
Another possible solution involves creating a custom levenshtein function for MySQL. Here, you can find an example of one:
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;