Search code examples
phpmysqlsearchmisspelling

How to "redefine search" or correct "misspelling" from the database


I want to add new feature to the search in my website. I'm using PHP and MySQL. MySQL database containing a table to the items that the user will search for, for each item there is a "keyword" column that's comma separated keywords "EXAMPLE: cat,dog,horse". After the user search in my website I want to get the words that are let me say "85%" similar to his search keyword, this is for redefine search. And for misspelling I want a service or something that provide if the keyword is correct or misspelled so I get some corrections and check if those exists in the database and then give those corrections to user to change his search keyword.

I'm not asking for a solution here ... but if you can direct me in a one way or another that will be great.


Solution

  • The key is in your idea of "85% similar". Here are some ideas:

    Similar Words Table

    You can define a table where you list common misspellings for your keywords. You'll then have to augment how you search the database to map common misspellings to the proper value.

    Similar Words Lookup

    When you perform the search, use a library to generate similar words and search for all of them. You can use any sort of spelling library to generate possible word matches before sending the search. Or write your own based on the Edit Distance algorithm.

    Only check if needed: Since you're using PHP, you may consider pspell. You can first call pspell_check to see if the word is spelled correctly. Then call pspell_suggest to get suggestions.

    See this link for an example.

    Use a Database Feature

    MySQL, for example has a SOUNDS_LIKE operator. You can search for WHERE keyword SOUNDS_LIKE 'kat' and (presumably) get cat. More info is on the documentation page, which alerts you to some limitations (like English and UTF-8 only).


    It sounds like a fairly common problem, so perhaps there are other more canonical solutions to this problem. Perhaps there's something specific to the language you're using (or in the database interface layer) that can abstract this for you.

    The first two should allow you to meet some notion of 85% similarity. I have no idea how well the third option will work, but it "soundz kool."