Search code examples
algorithmnlphidden-markov-modelsmarkov

Why use hidden Markov model vs. Markov model in Baum Welch algorithm


So I am trying to build the Baum Welch algorithm to do parts of speech tagging for practice. However, I am confused about using a hidden Markov Model vs. a Markov Model. Since it seems that you are losing context moving from state to state. Since the output of the last state isn't taken into account when moving to the next state. Is it just to save memory?

edit: added an example for clarity

For example if two states, A and B output a 0 or 1 there will be 4 state transitions and 2 obseravation possibilities for each state, which can be can be made into 8 transitions if you mix each pair of incoming transitions with it's state's obseravation probabilities. But my hang up is why not initially train a machine with four states {(A,1),(B,1),(A,2),(B,2)} with 16 transitions. I am quite new to nlp so I wondering if I am unaware of some algorithmic redundancy that hard to see without harder math.

Since it seems that one loses the information of what the transitions will be when of the last A was 1 vs. 2. But I'm wondering if the training algorithms might not need that information.

https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm

Thanks for the information.


Solution

  • It's not just to save memory, it's to provide a better model of what is really going on. In the case of text, you believe that there is an underlying grammar which says that this word is being used as a noun, and that word is being used as a verb, but you don't get labels that say this and it isn't always obvious from the data. E.g. - looking at what I have just typed, better is an adjective in "a better model" but if I am using stack overflow to better myself, I have just used better as a verb. So whether better is an adjective or a verb is a hidden state and a realistic model will reflect this.

    Hidden markov models are also flexible enough that if you really don't have any hidden state, you can create a degenerate sort of hidden markov model that reflects this. For instance, if every hidden state can produce only one possible output, and no two hidden states can ever produce the same output, then you have a hidden markov model in which you can always predict the so-called hidden state from the output and vice versa. It will be very easy to fit the parameters of this, but it probably won't be as good at modelling reality as a proper hidden markov model.