How to create parameters for HMM model of online handwriting recognition?

I'm a little inexperienced about Hidden Markov Model. If I want to make a model of HMM for online handwriting recognition (which means handwriting recognition of letter that user draw live on device instead of recognizing image of letter), how is the parameter model? Like what are the:

hidden states,
observations,
initial state probabilities,
state transition probabilities,
emission probabilities?

What I have right now maybe the observations, which is the array of { x, y, timestamp } that is each dot that I record from the user's finger movement on the tablet.

The system will only record/train/recognize one number at a time. Which means I have 10 (0 to 9) states???? Or 10 classification results?? From various website like this, I found that hidden states usually in form of "sequences", instead of one single state like that. What is the states then in this case?

Solution

HMMs work well with temporal data, but it may be a suboptimal fit for this problem.

As you've identified, the observations {x, y, timestamp} are temporal in nature. As a result, it is best cast as the emissions of the HMM, while reserving the digits as states of the HMM.

Explicitly, if numbers (0 to 9) are encoded as hidden states, then for a 100 x 100 "image", the emission can be one of 10000 possible pixels coordinates.
The model predicts a digit state at every timestamp (on-line). The output is a non-unique pixel location. This is cumbersome but not impossible to encode (you'd just have a huge emission matrix).
The initial state probabilities of which digit to start can be a uniform distribution (1/10). More cleverly, you can invoke Benford's law to approximate the frequency of digits appearing in text and distribute your starting probability accordingly.
State transition and emission probabilities are tricky. One strategy is to train your HMM using the Baum-Welch (a variation of Expectation-Maximization) algorithm to iteratively and agnostically estimate parameters for your transition and emission matrices. The training data would be known digits with pixel locations registered across time.

Casting the problem the other way is less natural because of this lack of temporal fit, but not impossible.

You can also make 10000 possible states aligning to the pixels, while having 10 emissions (0-9).
However, most commonly used algorithms for HMMs have run times quadratically related to number of states ( ie. Viterbi algorithm for the most likely valid hidden states runs O(n_emissions * n_states^2)). You are incentivized to keep the number of hidden states low.

Unsolicited suggestions

Maybe what you might be looking for are Kalman filters, which may be a more elegant way to develop this on-line digit recognition (outside of CNNs which appear to the most effective) using this time-series format.

You may also want to look at structured perceptrons, if your emissions are multivariate (ie. x, y) and independent. Here, I believe x, y coordinates should be correlated and should be respected as such.