Search code examples
artificial-intelligencespell-checkingspelling

How to do Norvig spell check for chinese characters mixed with english letters?


I have a list of product names written in mixture of English letters and numbers and Chinese characters stored in my database.

There is a table called products with the fields name_en, name_zh amongst others.

E.g.

AB 10"机翼

Peter Norvig has a fantastic algorithm for spell check but it only works for English.

I was wondering if there's a way to do something similar for a a narrow list of terms containing Chinese characters?

E.g. of mispelling such as

A10机翼
AB 10鸡翼
AB 10鸡一
AB 10木几翼

all will prompt AB 10"机翼 as the correct spelling

How do I do this?


Solution

  • You have a much more complex problem than Norvig's:

    Chinese Input-method

    The mis-spellings in your case (at least in your example) is mostly caused by the pinyin input method. One same typing of "jiyi" (English: airplane wings) could lead to different Chinese phrases:

     机翼
     鸡翼
     鸡一
     几翼
    

    Chinese Segmentation

    Also in Chinese to break up a long sentence into small tokens with semantic meaning, you would need to do segmentation. For example:

    飞机模型零件 ->  Before segmentation
    飞机-模型-零件   After segmentation you got three phrases separated by '-'.
    

    Work on the token-level

    You probably can experiment starting from a list of mis-spellings. I guess you can collect a bunch of them from your user logs. Take out one misspelling at a time, using your example:

    AB 10鸡翼
    

    First break it into tokens:

    A-B-10-鸡翼
    

    (here you probably need a Chinese segmentation algorithm to realize that 鸡翼 should be treated together).

    Then you should try to find its nearest neighbor in your product db using the edit distance idea. Note that:

    • you do not remove/edit/replace one character at a time, but remove/edit/replace one token at a time.
    • when edit/replace, we should limit our candidates to be those near neighbors of the original token. For example, 鸡翼 -> 机翼,几翼,机一

    Build Lucene index

    You can also try to tackle the problem in a different way, starting from your correct product names. Treat each product name as a document and pre-build lucene index from that. Then for each user query, the query-matching problem is converted to a search problem in which we issue a query to the search-engine for find the best-matching documents in our db. In this case, I believe Lucene would probably takes care of the segmentation (if not, you would need to extend its functionality to suit your own needs) and tokenization for you.