python numpy machine-learning scipy hmmlearn

missing data in hmmlearn from scikit-learn

i'm running a simple HMM using scikit-learn's hmmlearn module. it works for fully observed data, but it fails when i pass it observations with missing data. small example:

import numpy as np
import hmmlearn
import hmmlearn.hmm as hmm

transmat = np.array([[0.9, 0.1],
                     [0.1, 0.9]])
emitmat = np.array([[0.5, 0.5],
                    [0.9, 0.1]])

# this does not work: cannot have missing data
obs = np.array([0, 1] * 5 + [np.nan] * 5)

# this works
#obs = np.array([0, 1] * 5 + [1] * 5)

startprob = np.array([0.5, 0.5])
h = hmm.MultinomialHMM(n_components=2,
                       startprob=startprob,
                       transmat=transmat)
h.emissionprob_ = emitmat
print obs, type(obs)
posteriors = h.predict_proba(obs)
print posteriors

if obs is fully observed (every element is 0 or 1) it works but i would like to get estimates for unobserved data points. i tried encoding these as np.nan or None but neither works. it gives the error IndexError: arrays used as indices must be of integer (or boolean) type (in hmm.py", line 430, in _compute_log_likelihood).

how can this be done in hmmlearn?

Solution

Currently there's no way of doing missing data imputation using hmmlearn.

As an ad hoc approach you can partition the observation sequence into fully observed subsequences and then for each subsequence either pick the most likely next state and observation or just simulate them randomly from the transition and emission probabilities. Note that this strategy can lead to inconsistencies on the subsequence boundaries.