Search code examples
algorithmopen-sourcecjktext-segmentation

Is there any good open-source or freely available Chinese segmentation algorithm available?


As phrased in the question, I'm looking for a free and/or open-source text-segmentation algorithm for Chinese, I do understand it is a very difficult task to solve, as there are many ambiguities involed. I know there's google's API, but well it is rather a black-box, i.e. not many information of what it is doing are passing through.


Solution

  • The keyword text-segmentation for Chinese should be 中文分词 in Chinese.

    Good and active open-source text-segmentation algorithm :

    1. 盘古分词(Pan Gu Segment) : C#, Snapshot
    2. ik-analyzer : Java
    3. ICTCLAS : C/C++, Java, C#, Demo
    4. NlpBamboo : C, PHP, PostgreSQL
    5. HTTPCWS : based on ICTCLAS, Demo
    6. mmseg4j : Java
    7. fudannlp : Java, Demo
    8. smallseg : Python, Java, Demo
    9. nseg : NodeJS
    10. mini-segmenter: python

    Other

    1. Google Code : http://code.google.com/query/#q=中文分词
    2. OSChina (Open Source China)

    Sample

    1. Google Chrome (Chromium) : src, cc_cedict.txt (73,145 Chinese words/pharases)

      • In text field or textarea of Google Chrome with Chinese sentences, press Ctrl+ or Ctrl+

      • Double click on 中文分词指的是将一个汉字序列切分成一个一个单独的词