About training HMM by using EM

I am new to EM algorithm, studying Hidden Markov Model.

During training my HMM by EM, I am very confused on the data setting. (text processing)

Please confirm whether my EM usage is okay or not.

At first, I calculated statistics for emission probability matrix with my whole training set. And then, I ran EM with the same set. -> Emission probability for unseen data converged to zero at the time.

While I read a text, Speech and Language Processing, I found the exercise 8.3 tells two phase training method.

8.3 Extend the HMM tagger you built in Exercise 8.?? by adding the ability to make use of some unlabeled data in addition to your labeled training corpus. First acquire a large unlabeled corpus. Next, implement the forward-backward training algorithm. Now start with the HMM parameters you trained on the training corpus in Exercise 8.??; call this model M0. Run the forward-backward algorithm with these HMM parameters to label the unsupervised corpus. Now you have a new model M1. Test the performance of M1 on some held-out labeled data.

Following this statement, I select some instances from my training set (1/3 of training set) for getting initial statistics. And then, I run EM procedure with whole training set for optimizing parameters in EM.

Is it ok?

Solution

The procedure that the exercise is referring to is a type of unsupervised learning known as self-training. The idea is that you use your entire labeled trainign set to build a model. Then you collect more data that is unlabeled. It is much easier to find new unlabeled data than it is to find new labeled data. After that, you would label the new data using the model you originally trained. Now, using the automatically generated labels, train a new model.