i'm running a simple HMM using scikit-learn's hmmlearn
module. it works for fully observed data, but it fails when i pass it observations with missing data. small example:
import numpy as np
import hmmlearn
import hmmlearn.hmm as hmm
transmat = np.array([[0.9, 0.1],
[0.1, 0.9]])
emitmat = np.array([[0.5, 0.5],
[0.9, 0.1]])
# this does not work: cannot have missing data
obs = np.array([0, 1] * 5 + [np.nan] * 5)
# this works
#obs = np.array([0, 1] * 5 + [1] * 5)
startprob = np.array([0.5, 0.5])
h = hmm.MultinomialHMM(n_components=2,
startprob=startprob,
transmat=transmat)
h.emissionprob_ = emitmat
print obs, type(obs)
posteriors = h.predict_proba(obs)
print posteriors
if obs
is fully observed (every element is 0 or 1) it works but i would like to get estimates for unobserved data points. i tried encoding these as np.nan
or None
but neither works. it gives the error IndexError: arrays used as indices must be of integer (or boolean) type
(in hmm.py", line 430, in _compute_log_likelihood
).
how can this be done in hmmlearn?
Currently there's no way of doing missing data imputation using hmmlearn
.
As an ad hoc approach you can partition the observation sequence into fully observed subsequences and then for each subsequence either pick the most likely next state and observation or just simulate them randomly from the transition and emission probabilities. Note that this strategy can lead to inconsistencies on the subsequence boundaries.