Hmmlearn classification usage

I'm trying to learn a model with hmmlearn in order to make a classification on my dataset. The dataset have a list of sequences having different length. Each sequence consists of event emission. For example:

ID1: ['1', '10', '8', '15']
ID2: ['1', '10', '8', '15', '156', '459', '256']

This is the code I'm using. I found a similar example here.

    sequence_map = __load_df(file)

    x = []
    lengths = []

    for values in sequence_map.values():
        x.append(values)
        lengths.append(len(values))

    x = np.concatenate(x)

    model = hmm.GaussianHMM(n_components=2, algorithm='map', n_iter=1000, covariance_type="full").fit(x, lengths=lengths)
    predictions = model.predict(x, lengths)

I'm interested in classifying the event in two categories, so I choose n_components=2.
How I could now retrieve the class for each sequence in my dataset?

Solution

The function predict predicts the most likely sequence of states given the input. This is not what you want with respect to your problem that is about classifying among two classes.

What you need might be the method predict_proba (see the documentation here) which will give you a probability by state.

However, keep in mind that you cannot know for sure how the HMM learned to distinguish between the two classes. Which means that, except if you somewhat estimated/initialized the first Gaussian parameters from the training samples belonging to class 1 and the other Gaussian from the samples belonging to class 2, you cannot know whether the HMM has associated each one of its states to a class. The two classes could also have been learned in the way that it is the sequence used between the states that makes them different. Like, class 1 gives a state pattern of S1-S2-S1-S2-S1-S2... and class 2 gives a pattern of S1-S1-S2-S2-S1-S1-... Let's remember that HMMs are good for time series. Ask yourself: if a single Gaussian distribution could fully represent one class, then why using hidden Markov models?

For (binary) classification, a more reliable approach is to train two HMMs, one from samples of the first class, and the other from the samples of the second class. Once trained, each test sequence is scored with respect to both models with the method score (see the documentation here). This will return the log-likelihood of the sequence you pass as an input with respect to the model you call it on. The test sample is then classified in the class of the model returning the highest likelihood result.

What is interesting with such approach is that in the case both models give a low likelihood result, you can also tag a test sequence as being part of neither of the two classes of interest. This also generalizes well up to a few dozen classes.