Search code examples
javanlpclassificationstanford-nlp

How to train a Chinese segmenter model by Stanford NLP Tools


I am new to the Stanford CoreNLP Tools. Now I do not get an excellent segment result in Chinese, so I want to change the granulity of the Segmenter. I thought I could do this by training my own dict.

I downloaded the trainSegmenter-20080521 file, and follow the trainSegmenter-20080521/README.txt.

This is the README.txt:

Sat Jun 21 00:57:22 2008
Author: Pi-Chuan Chang

Here's a documentation of how to train and test the segmenter on specific split 
range of the CTB data.

The following steps assumes you have 3 files defining the ranges of train/dev/test.
They should be named as "ctb6.train", "ctb6.dev", "ctb6.test" respectively.
The format should be like:
      chtb_0003.fid
      chtb_0015.fid
      ...

[STEP 1] change the CTB6 path in the Makefile:
      CTB6=/afs/ir/data/linguistic-data/Chinese-Treebank/6/

[STEP 2] download and uncompress the lastest segmenter from:
      http://nlp.stanford.edu/software/stanford-chinese-segmenter-2008-05-21.tar.gz
and change this path in the Makefile to your local path:
      SEGMENTER=/tmp/stanford-chinese-segmenter-2008-05-21/

[STEP 3] simply type:
      make all
You can also split down into these sub-steps:
      make internaldict # make internal dictionaries for affixation feaetures
      make data         # make datasets
      make traintest    # train & test the CRF segmenter

But I still have some problems:

  1. What is the format of the training file and what is train/dev/test each for?

  2. What is the chtb_0003.fid, chtb_0015.fid and so on?

  3. What is the CTB6 path in the Makefile, it seems that I shoud change the variable CTB6 into /afs/ir/data/linguistic-data/Chinese-Treebank/6/. But it is aready there and it seems not a valid subpath.

By the way, there are many properties should be set for special demands, e.g., sighanPostProcessing and serDictionary.

Is there somewhere I can get all of those properties and their explanation?

All I can do now is to read the source code, e.g., edu.stanford.nlp.sequences.SeqClassifierFlags.java, but I still get confused with these property flags.

So appreciated for anyone's help.


Solution

  • I would ignore that README. The information in it is fairly out of date.

    A more recent explanation is here:

    http://nlp.stanford.edu/software/segmenter-faq.shtml

    The expected input format is one sentence per line with already segmented text on each line. If you get your segmented data from parse trees, there are tools which will convert from parse tree to segmented text.

    If there are particular sentences which aren't segmented correctly, it may be because it is using the CTB segmentation standard and you would prefer something different. It may also be because of words that the segmenter doesn't know about. If you send example sentences which follow the CTB segmentation standard to java-nlp-user, those unknown words will eventually make their way into the segmenter's training data.